feat(spu): in-process EmbedderClient backend (PR-1.F) by toddwbucy · Pull Request #294 · toddwbucy/WeaverTools

toddwbucy · 2026-05-08T01:34:06Z

Summary

Introduce the in-process candle-backed embedder backend per `docs/specs/weaver-spu-Spec.md` §10 Phase 1 (PR-1.F).
No consumer wired yet — follow-ups: backend selection in `serve.rs` (chore: untrack ONBOARDING.md #116), agent-side runtime cutover in `herobench/benchmark.rs` (feat(inference): ModelEntry family + mode + embedder fields (REG-02) #118), Python embedder retire (#TBD).
Implements `weaver_core::embedder::Embedder` against `JinaV4Embedder`. `embed_late_chunked` deliberately fails loud (`EmbeddingError::NotAvailable`) until the token-level forward surface lands.

Why in-process

Per the project mantra "latency is the enemy of agency": every cross-process hop on every memory read is dead cost. The SPU is one unit (encoder + decoder co-resident in `weaver-infer`); if one half crashes, the whole stack should die. That's the desired failure mode — no silent fallback to a wedged embedder.

What's in this PR

File	Change	Why
`crates/weaver-spu/src/encoder/client.rs`	NEW (~190 LOC + tests)	`EmbedderClient` struct + `Embedder` trait impl
`crates/weaver-spu/src/encoder/mod.rs`	+5 LOC	Wire `client` module under `cfg(feature = "candle")`
`crates/weaver-spu/src/encoder/jina_v4.rs`	+6 LOC	Pub `max_seq_len()` getter for `EmbedderInfo::max_seq_length`
`crates/weaver-spu/src/encoder/qwen25vl_loraed.rs`	+1 LOC	`#[allow(clippy::too_many_arguments)]` on cfg-gated debug fn

Design notes

Concurrency model: `encode_text` needs `&mut self` (KV cache state); the trait surfaces `&self`. Wrapped behind `Arc<parking_lot::Mutex>`. Forward passes run under `tokio::task::spawn_blocking` since they're sync + GPU-bound (seconds-long; would starve the runtime if blocking inline). `parking_lot::Mutex` rather than `std::sync::Mutex` so a panic in `encode_text` doesn't poison the slot for the daemon's lifetime.
Late-chunked: returns `EmbeddingError::NotAvailable`. The token-level forward surface needed to drive `late_chunk.rs` from the in-process path hasn't landed yet; failing loud is preferable to silently degrading to single-vector pooling. Consumers needing late chunking keep `backend = "python"` until the surface lands (tracked separately).
`weight_revision`: derived from snapshot dir basename (HF cache layout puts the SHA in `snapshots//`). Empty string when the path has no usable basename. Read by the cohort-pin guard for identity-drift detection.
`health_check()` override: always `true` once construction succeeds. The model is mmapped for the daemon's lifetime; no runtime path can flip `model_loaded` back to false. Avoids the round-trip through `info()` of the trait default.
Clippy fix on qwen25vl_loraed.rs: `run_attention_post_proj` reaches 9 args under `feature = "debug-layers"`. Lint passed on default-features but blocked `cargo clippy --features flash-attn --lib -- -D warnings`. Splitting the function for a cfg-gated diagnostic-only param would be worse than the suppression.

Test plan

CodeRabbit review clean
`cargo check -p weaver-spu --features flash-attn --lib` — clean
`cargo clippy -p weaver-spu --features flash-attn --lib -- -D warnings` — clean
`cargo check -p weaver-spu --features flash-attn --all-targets` — clean (verifies tests + binaries still compile against lib changes)
End-to-end smoke (deferred to PR-B): daemon boot constructs `EmbedderClient::from_snapshot`; cohort-pin guard reads its `info()`

Follow-ups (existing tasks)

chore: untrack ONBOARDING.md #116 Phase 1: backend selection in serve.rs (config knob + boot-time pin probe path branch)
feat(inference): Jina V4 vision path via llama.cpp mtmd #117 Phase 1: smoke-test in-process backend end-to-end
feat(inference): ModelEntry family + mode + embedder fields (REG-02) #118 Phase 1: agent-side runtime cutover (herobench/benchmark.rs and any other `Arc` consumers)

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- In-process local embedder added for faster, on-device embedding generation.
- Exposes model max sequence length so longer inputs are handled correctly.
- Note: token-level embedding output is not available from the in-process backend.
Chores
- Module export updates and small code quality/style tweaks.

Phase 1 of the embedder cutover per docs/specs/weaver-spu-Spec.md §10: introduce the in-process candle-backed embedder backend. No consumer is wired up yet — that's the follow-up split into: - PR-A (this): land the EmbedderClient + supporting plumbing - PR-B: backend selection in serve.rs (boot-time pin probe path) - PR-C: agent-side runtime cutover (herobench/benchmark.rs) - PR-D: retire grpc_client_legacy + Python embedder service ## What's in this PR - `crates/weaver-spu/src/encoder/client.rs` (NEW): `EmbedderClient` struct holding `JinaV4Embedder` behind `Arc<parking_lot::Mutex<…>>`. Implements `weaver_core::embedder::Embedder`. Forward passes run under `tokio::task::spawn_blocking` (the GPU forward is sync and seconds-long; blocking on the runtime would starve other tasks). `parking_lot::Mutex` rather than `std::sync::Mutex` so a panic in `encode_text` doesn't poison the slot for the daemon's lifetime. - `embed_late_chunked` returns `EmbeddingError::NotAvailable` for now. The token-level forward surface needed to drive `late_chunk.rs` from the in-process path hasn't landed yet; failing loud is preferable to silently degrading to single-vector pooling. Consumers needing late chunking keep `backend = "python"` until the surface lands. - `encoder/mod.rs`: wire the new `client` module under `cfg(feature = "candle")` (consistent with the rest of the candle encoder modules). - `encoder/jina_v4.rs`: add `pub fn max_seq_len()` getter so `EmbedderClient` can populate `EmbedderInfo::max_seq_length` for the cohort-pin guard. - `encoder/qwen25vl_loraed.rs`: pre-existing clippy lint (`run_attention_post_proj` reaches 9 args under `feature = "debug-layers"`) was passing on default-features but blocked `cargo clippy --features flash-attn --lib -- -D warnings`. Added `#[allow(clippy::too_many_arguments)]` — splitting the function for a cfg-gated diagnostic-only param would be worse. ## Why in-process Per the project mantra "latency is the enemy of agency": every cross- process hop on every memory read is dead cost. The SPU is one unit (encoder + decoder co-resident in `weaver-infer`); if one half crashes, the whole stack should die. That's the desired failure mode — no silent fallback to a wedged embedder. ## Validation - `cargo check -p weaver-spu --features flash-attn --lib` — clean - `cargo clippy -p weaver-spu --features flash-attn --lib -- -D warnings` — clean - `cargo check -p weaver-spu --features flash-attn --all-targets` — clean (verifies tests + binaries still compile against the lib changes) End-to-end smoke validation deferred to PR-B (boot-time pin probe path), where the daemon constructs an `EmbedderClient::from_snapshot` and the cohort-pin guard reads its `info()`. ## Design notes - `weight_revision` derived from snapshot dir basename (HF cache layout puts the SHA in the `snapshots/<sha>/` directory). Empty string when the path has no usable basename. - `dimension` hardcoded to `matryoshka::FULL_DIM` (2048). Matryoshka truncation isn't surfaced through the trait yet; would be a per-call parameter when callers need it. - `health_check()` overrides the trait default — always returns `true` once construction succeeds, since the model is mmapped for the daemon's lifetime and there's no runtime path that flips `model_loaded` back to false. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai · 2026-05-08T01:34:21Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4793ca59-171c-402e-ab63-fcf21a397810

📥 Commits

Reviewing files that changed from the base of the PR and between 98122b5 and 8826da3.

📒 Files selected for processing (1)

crates/weaver-spu/src/encoder/client.rs

📝 Walkthrough

Walkthrough

Adds an in-process, candle-backed EmbedderClient using Jina V4 snapshots; caches EmbedderInfo (including max sequence length), serializes GPU access via a mutex, validates embedding shapes, implements Embedder trait (embed, info, health_check) and returns NotAvailable for late-chunked requests; exposes client module under the candle feature.

Changes

In-Process Embedder Client Implementation

Layer / File(s)	Summary
JinaV4 API Extension `crates/weaver-spu/src/encoder/jina_v4.rs`	Public `max_seq_len()` accessor exposes the embedder's stored max sequence length.
EmbedderClient Constructor `crates/weaver-spu/src/encoder/client.rs`	`from_snapshot` loads JinaV4Embedder, derives weight revision, caches EmbedderInfo (dimension, max_seq_length), and wraps the embedder in `Arc<parking_lot::Mutex<_>>`.
Core Embedder Trait Methods `crates/weaver-spu/src/encoder/client.rs`	`embed` uses `tokio::spawn_blocking`, locks the embedder for GPU serialization, runs `encode_text`, and validates batch size and per-vector dimensionality; `embed_late_chunked` returns `NotAvailable`; `info` returns cached metadata; `health_check` returns `true`.
Error Handling `crates/weaver-spu/src/encoder/client.rs`	Private `IoError` wrapper maps blocking/join/anyhow failures into `EmbeddingError::Transport`.
Module Export `crates/weaver-spu/src/encoder/mod.rs`	Publicly export `client` submodule behind `candle` feature for daemon wiring.
Linting `crates/weaver-spu/src/encoder/qwen25vl_loraed.rs`	Added `#[allow(clippy::too_many_arguments)]` to `run_attention_post_proj`; no logic changes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

toddwbucy/WeaverTools#283: Introduces the JinaV4Embedder that this PR uses and extends with a public max_seq_len accessor.
toddwbucy/WeaverTools#275: Adds the Embedder trait and EmbedderInfo types that EmbedderClient implements.
toddwbucy/WeaverTools#270: Wires candle/tokenizers feature support required by the in-process embedder backend.

Poem

🐰 In candlelight I learned to hum,
I load the weights and beat the drum,
A mutex keeps the GPU tame,
Vectors march with proper frame,
Late chunks—no flight, I whisper "shh."

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(spu): in-process EmbedderClient backend (PR-1.F)' accurately summarizes the main change: introducing a new in-process embedder client implementation backed by candle/JinaV4Embedder, which is the primary feature added across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/weaver-spu/src/encoder/client.rs`:
- Around line 165-167: The current .await .map_err(...) chain around the
spawn_blocking result incorrectly converts panics into EmbeddingError::Transport
(IoError) and keeps the daemon running; modify the error handling for the
JoinError from spawn_blocking (the await on the JoinHandle returned by
spawn_blocking in this file) to check JoinError::is_panic() and call
std::panic::resume_unwind(join_err.into_panic()) to re-propagate panics,
otherwise map non-panic join errors into EmbeddingError::Transport (wrapping
IoError) as before; specifically update the code around the .await that
currently maps both errors into EmbeddingError::Transport so panics are resumed
instead of converted.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 354d8066-91ec-4518-b4b5-da8a185e123d

📥 Commits

Reviewing files that changed from the base of the PR and between c2ceb87 and 98122b5.

📒 Files selected for processing (4)

crates/weaver-spu/src/encoder/client.rs
crates/weaver-spu/src/encoder/jina_v4.rs
crates/weaver-spu/src/encoder/mod.rs
crates/weaver-spu/src/encoder/qwen25vl_loraed.rs

…review) CR finding: turning a `JoinError::is_panic()` from `tokio::task:: spawn_blocking` into an `EmbeddingError::Transport` masks bugs as transient transport failures. Per the canonical Tokio pattern (see `tokio::task::JoinError` docs), `resume_unwind(join_error.into_panic())` is the right propagation when `is_panic()` is true so error-handling downstream sees a real panic vs a real error. Cancellation isn't a path we exercise (we don't cancel the spawn_blocking handle), but if it ever happens, treat it as a transport error so the caller sees something rather than nothing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…capped value (PR #297 review) The getter's rustdoc said it returned the architectural ceiling read from `config.json::max_position_embeddings`. That was correct when the getter was added in PR #294 but became stale after the cap landed earlier in this PR. Now states explicitly that it returns `min(JINA_V4_TRAINED_CONTEXT, raw.max_position_embeddings)` and references both terms so a reader can see the relationship between the cap, the raw config field, and the value `EmbedderInfo` exposes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread crates/weaver-spu/src/encoder/client.rs

toddwbucy merged commit d6ca4e8 into main May 8, 2026
1 check passed

toddwbucy deleted the feat/spu-in-process-embedder-client branch May 8, 2026 02:26

toddwbucy mentioned this pull request May 8, 2026

feat(spu): backend selection in serve.rs (PR-1.G) #296

Merged

6 tasks

coderabbitai Bot mentioned this pull request May 8, 2026

fix(spu): kernel restore + Jina V4 trained-context cap (PR-1.H) #297

Merged

3 tasks

coderabbitai Bot mentioned this pull request May 8, 2026

feat(herobench): trait-typed embedder + opt-in in-process backend (PR-1.I) #298

Merged

4 tasks

toddwbucy mentioned this pull request May 8, 2026

feat: retire Python embedder + gRPC backend (PR-1.J) #300

Merged

5 tasks

This was referenced May 12, 2026

audit: embedder swap — Jina V4 → Qwen3-Embedding-0.6B-Q8 GGUF (PR 1 of 2) #317

Merged

feat(spu): LlamaCppEmbedderClient — Qwen3-Embedding GGUF backend (PR 3 of N) #319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spu): in-process EmbedderClient backend (PR-1.F)#294

feat(spu): in-process EmbedderClient backend (PR-1.F)#294
toddwbucy merged 2 commits into
mainfrom
feat/spu-in-process-embedder-client

toddwbucy commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

toddwbucy commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why in-process

What's in this PR

Design notes

Test plan

Follow-ups (existing tasks)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

toddwbucy commented May 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading