toddwbucy · toddwbucy · May 12, 2026 · May 12, 2026 · May 12, 2026 · May 12, 2026
diff --git a/docs/audit/embedder-swap-2026-05-12.md b/docs/audit/embedder-swap-2026-05-12.md
@@ -0,0 +1,168 @@
+# Embedder swap audit — Jina V4 → Qwen3-Embedding-0.6B-GGUF (Q8_0)
+
+**Date:** 2026-05-12
+**Status:** Audit (PR 1 of 2; implementation follows in PR 2)
+**Related memory:** `project_embedder_per_profile`, `project_jina_v4_mandatory` (SUPERSEDED)
+**Verification status:** llama.cpp + Qwen3-Embedding-0.6B-Q8_0 GGUF compatibility confirmed via `llama-cpp-python` smoke test 2026-05-12 morning — 1024-dim embeddings, last-token pooling, correct retrieval ranking on a HeroBench fixture query.
+
+---
+
+## 1. Motivation (one-line)
+
+Jina V4 (3.7B params, 2048 dim, ~7.4GB VRAM) is overspec'd for the current development phase. Switching to Qwen3-Embedding-0.6B-Q8_0 GGUF (595M params, 1024 dim, ~635MB VRAM) preserves 32K context (so late-chunking unchanged), aligns the encoder tokenizer with the Qwen3-Coder decoder, and runs ~6× lighter — exercising the substrate-pluggability commitment per `memory-substrate-Spec.md` §3.5.
+
+## 2. Architectural decisions
+
+| Decision | Choice | Rationale |
+|---|---|---|
+| Dev-default embedder for Qwen3-family agents | **Qwen3-Embedding-0.6B-Q8_0 GGUF** | Tokenizer alignment with decoder; 32K context preserves late-chunking; ~6× lighter than Jina V4 |
+| Dev-default for non-Qwen3 (Gemma) agents | nomic-embed-text-v1.5 | Neutral choice; no first-party Gemma embedder exists |
+| Serving stack | **llama.cpp via `llama-cpp-sys-2`** (existing dep) | Same C++ engine that serves the decoder; no Python in runtime; in-process per-agent matches `architecture_unified_spu` |
+| Candle fork | **Unchanged** | Stays in place for Jina V4 backend (one valid backend among several per per-profile pluggability); no fork updates needed for this swap |
+| Vector index dimension | **Profile-driven, not const** | Current `BELIEF_EMBEDDING_INDEX.dimension = 2048` const must become builder-constructed from agent's spu config |
+| Existing dev-DB content | **Rebuild required** | Dimension change 2048 → 1024 invalidates the HNSW index; all current dev agents are throwaway per the brief, so no migration path needed |
+
+## 3. Reference-count summary
+
+```
+jina-v4 / jina_v4 / JinaV4 references:           432  (most legitimate — see §4)
+2048 dimension references (in .rs files):         62  (~12 are framework leaks — see §5)
+EmbedderClient construction sites:                11  (need llama.cpp sibling — see §6)
+Agent YAML files with spu.encoder declarations:   20  (config update — see §7)
+Test fixtures asserting Jina specifics:           77  (update or remove — see §8)
+```
+
+## 4. Legitimate references (stay as-is)
+
+Per the brief: "No hardcoded references to Jina V4 model identifier or 2048 dimension remain **outside** weaver-spu (the embedder implementation layer)."
+
+These directories are the embedder implementation layer — references stay:
+
+| Location | Count | Why it stays |
+|---|---|---|
+| `crates/weaver-spu/src/encoder/` | 77 | The candle-Jina V4 backend implementation. Continues to exist as one valid backend in the per-profile world. |
+| `crates/weaver-spu/src/models/` | 22 | Model definitions for the candle backend. |
+| `crates/weaver-spu/src/decoder/` (50/-5) | 45 | Some of these are `multi_model.rs` server-config registrations of Jina V4 as one valid model — NOT decoder-side dependencies on Jina V4. See §5 for the few that are actual leaks. |
+| `crates/weaver-spu/src/core/` | 25 | Internal supporting code for the encoder + decoder paths. Includes `probe.rs` which queries embedder dimension dynamically (legitimate). |
+| `crates/weaver-spu/tests/` | 50 | Tests of the Jina V4 backend. Backend remains; tests remain. |
+| `crates/weaver-spu/src/bin/jina_embed.rs` | (small) | Standalone Jina-specific binary; can stay as a Jina-specific tool. |
+| `docs/specs/` | 74 | Documentation referencing Jina V4 as the historical default and one valid backend. Spec text update is a separate documentation PR; not blocking the implementation. |
+
+## 5. Framework leaks (must fix in PR 2)
+
+These are the actual violations of the "no model-specific assumptions outside the embedder implementation layer" principle:
+
+### 5.1 `crates/weaver-database/src/graph/belief.rs:114-120` — **HIGHEST PRIORITY**
+
+```rust
+pub const BELIEF_EMBEDDING_INDEX: BeliefEmbeddingIndexDef = BeliefEmbeddingIndexDef {
+    name: "embedding_hnsw",
+    field: BELIEF_EMBEDDING_FIELD,
+    dimension: 2048,    // ← hardcoded; framework leak
+    metric: "cosine",
+    n_lists: 64,
+};
+```
+
+**Fix:** replace the const with a builder function `belief_embedding_index_for_dim(dim: usize) -> BeliefEmbeddingIndexDef`. Callers in `ensure_belief_indexes` (and equivalents) take the agent's configured dim from the spu config and construct the index def at provisioning time.
+
+Companion changes in this file (lines 47, 66, 99, 287, 402, 454, 539): doc comments and test fixtures referencing "2048-dim Jina V4 embedding" — update text to "embedder-configured dim" and parameterize the test vector lengths.
+
+### 5.2 `crates/weaver-interface/src/harness.rs:207-250`
+
+Test fixtures asserting `model: "jina-v4"`, `dimension: 2048`. These test that the harness initializes correctly with whatever embedder is configured — they shouldn't assert specific model/dim values, they should assert against the runtime's reported model/dim.
+
+**Fix:** restructure the assertions to pin "model and dim match the configured embedder" rather than "model is jina-v4 and dim is 2048".
+
+### 5.3 `crates/weaver-interface/src/toml_edit.rs:2 hits` + `crates/weaver-interface/src/serve.rs:2 hits`
+
+Config-writing paths that emit `embedder.snapshot` referencing Jina V4 weights. Update default template to point at Qwen3-Embedding-0.6B-Q8_0 GGUF path; keep field substrate-agnostic.
+
+### 5.4 `crates/weaver-database/src/graph/runtime_schema.rs:3 hits`
+
+Schema-side references to 2048. Same fix pattern as §5.1: parameterize on the runtime-resolved dim.
+
+### 5.5 `crates/weaver-database/src/graph/episode.rs:2 hits`
+
+Episode-graph schema. Same parameterization pattern.
+
+### 5.6 Test fixtures (high-priority subset)
+
+| File | Action |
+|---|---|
+| `crates/weaver-database/tests/config_integration.rs:2 hits` | Update fixtures to use new default; relax `dim == 2048` assertion to "dim matches the configured embedder" |
+| `crates/weaver-core/tests/reconciliation_prompt_corpus.rs:2 hits` | Similar |
+| Other test files cited in §3 | Per-file review during PR 2 |
+
+## 6. EmbedderClient construction sites
+
+The current `EmbedderClient` in `weaver-spu/src/encoder/client.rs` is candle-based and Jina V4–specific. PR 2 adds a sibling for the GGUF/llama.cpp path:
+
+| Call site | File | Action |
+|---|---|---|
+| Daemon load handler | `crates/weaver-interface/src/server.rs:975, +3 more` | Dispatch on agent's spu config: Jina V4 profile → candle path; Qwen3-Embedding profile → llama.cpp path |
+| Server entrypoint | `crates/weaver-interface/src/serve.rs:211, +1` | Same dispatch |
+| Bench fixtures | `crates/weaver-demo/src/herobench/benchmark.rs:1078, +1` + `crates/weaver-demo/tests/herobench_integration.rs:2512, +1` | Update to use new default; remain Jina-compatible for backward profile testing |
+| Trait impl declaration | `crates/weaver-spu/src/encoder/client.rs:131` | Stays — sibling `LlamaCppEmbedderClient` added alongside |
+
+**Approach:** introduce a `LlamaCppEmbedderClient` in `weaver-spu/src/encoder/llama_cpp_client.rs` implementing the same `Embedder` trait. The daemon's load procedure picks which one to construct based on the agent's `spu.encoder.model_id` (or backend hint). Both coexist; per-profile pluggability is the dispatcher.
+
+## 7. Agent YAML templates
+
+20 yaml files reference `spu.encoder`. For PR 2, two are operationally important:
+
+| File | Action |
+|---|---|
+| `agents/herobench-benchero-*.yaml` (existing bench cohort) | Leave alone — these are throwaway artifacts from earlier runs. No re-embedding migration. |
+| New default template (TBD) | Reference `Qwen3-Embedding-0.6B-Q8_0.gguf` path, `model_id: "qwen3-embedding-0.6b-gguf"`, `context_size: 8192` (conservative; native is 32K, 8K matches our use cases). |
+| `test_agent_qwen3` (new, for tomorrow's smoke test) | New file using the default template above. |
+
+## 8. Documentation updates (best-effort, separate from blocking work)
+
+| Spec | Action |
+|---|---|
+| `belief-nodes-embedding-Spec.md` §2.1, §3.5 | Decision log entry; update example dim values; update "Jina V4 → 2048" to "embedder-configured" |
+| `memory-substrate-Spec.md` §3.5, §4.2 | Decision log entry |
+| `embedder-service.md` | Update default-embedder reference |
+
+These can land in PR 2 or a follow-up doc-only PR — not blocking the implementation.
+
+## 9. Deferred items (logged, not in this PR)
+
+Per the brief: "If something tempting to refactor surfaces during the audit, log it for a follow-up rather than absorbing it into this change."
+
+| Item | Why deferred |
+|---|---|
+| candle-native Qwen3-Embedding implementation | Out of scope; llama.cpp via `llama-cpp-sys-2` is sufficient and ships fast |
+| BERT routing model integration | Out of scope per brief |
+| nomic-embed implementation for Gemma agents | Out of scope for PR 2 (test_agent2 is Qwen3 anyway); add when Gemma agent provisioning is needed |
+| `weaver-embedding` crate cleanup (legacy gRPC client artifacts) | Cosmetic; can wait |
+| Spec text updates beyond decision-log entries | Doc-only follow-up |
+
+## 10. Decisions (locked 2026-05-12 morning)
+
+1. **Context size: 32K** (native). Don't truncate artificially even though our current content fits in 8K — preserving native context will matter later for late-chunking workloads and long-document retrieval. Agent YAML defaults to `context_size: 32768`.
+
+2. **GGUF install path**: `/opt/weaver/models/qwen3-embedding-0.6b/Qwen3-Embedding-0.6B-Q8_0.gguf` — mirrors the existing `/opt/weaver/models/jina-embeddings-v4-*.gguf` convention in `multi_model.rs:1332`. PR 2 includes the operator step of moving the file from the current HF-cache location at `/opt/weaver/huggingface/hub/gguf-models/qwen3-embedding-0.6B/` (download artifact) to the canonical install path (production location).
+
+3. **Encoder/decoder GPU placement**: same-GPU per agent as the default (Plan A — `architecture_unified_spu`), **configurable** via `spu.encoder.gpu` and `spu.decoder.gpu` independent fields. No code-side hard-coding of co-residency; the daemon's load procedure honors whatever the agent YAML declares. Plan B (split-GPU, one agent at a time) is operator-selectable per agent without code changes.
+
+## 11. PR 2 implementation outline
+
+Once decisions in §10 are made, PR 2 ships in this order:
+
+1. **`LlamaCppEmbedderClient`** in `weaver-spu/src/encoder/llama_cpp_client.rs`. Implements `Embedder` trait. Uses `llama-cpp-sys-2`. Same shape as existing `EmbedderClient` but constructed from a GGUF path instead of a candle snapshot.
+2. **Backend dispatch** in `server.rs::load_per_agent_embedder`: switch on agent's `spu.encoder.model_id` / backend hint. Constructs the right client.
+3. **Schema-side dim parameterization**: replace `BELIEF_EMBEDDING_INDEX` const with builder. Update callers.
+4. **Default agent YAML template**: new `agents/_default-qwen3.yaml` (or update existing template).
+5. **Test fixture updates**: `config_integration.rs`, `reconciliation_prompt_corpus.rs`, `harness.rs`, others.
+6. **Decision log entries** in `belief-nodes-embedding-Spec.md` + `memory-substrate-Spec.md`.
+7. **Rebuild doc**: brief note explaining how to migrate an existing dev agent (destroy + recreate; no in-place re-embedding).
+
+## 12. Success criteria (mirrors brief)
+
+- [ ] All currently running tests pass (or are appropriately updated)
+- [ ] A new development agent can be provisioned with the new embedder, exercised through a HeroBench session, and produce expected memory writes and retrievals
+- [ ] No hardcoded references to Jina V4 model identifier or 2048 dimension remain outside `weaver-spu` (the embedder implementation layer)
+- [ ] Documentation reflects the new development-default embedder
+- [ ] Decision log entry in `belief-nodes-embedding-Spec.md` noting rationale, scope, and deferred items