Skip to content
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 168 additions & 0 deletions docs/audit/embedder-swap-2026-05-12.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Embedder swap audit — Jina V4 → Qwen3-Embedding-0.6B-GGUF (Q8_0)

**Date:** 2026-05-12
**Status:** Audit (PR 1 of 2; implementation follows in PR 2)
**Related memory:** `project_embedder_per_profile`, `project_jina_v4_mandatory` (SUPERSEDED)
**Verification status:** llama.cpp + Qwen3-Embedding-0.6B-Q8_0 GGUF compatibility confirmed via `llama-cpp-python` smoke test 2026-05-12 morning — 1024-dim embeddings, last-token pooling, correct retrieval ranking on a HeroBench fixture query.

---

## 1. Motivation (one-line)

Jina V4 (3.7B params, 2048 dim, ~7.4GB VRAM) is overspec'd for the current development phase. Switching to Qwen3-Embedding-0.6B-Q8_0 GGUF (595M params, 1024 dim, ~635MB VRAM) preserves 32K context (so late-chunking unchanged), aligns the encoder tokenizer with the Qwen3-Coder decoder, and runs ~6× lighter — exercising the substrate-pluggability commitment per `memory-substrate-Spec.md` §3.5.
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

## 2. Architectural decisions

| Decision | Choice | Rationale |
|---|---|---|
| Dev-default embedder for Qwen3-family agents | **Qwen3-Embedding-0.6B-Q8_0 GGUF** | Tokenizer alignment with decoder; 32K context preserves late-chunking; ~6× lighter than Jina V4 |
| Dev-default for non-Qwen3 (Gemma) agents | nomic-embed-text-v1.5 | Neutral choice; no first-party Gemma embedder exists |
| Serving stack | **llama.cpp via `llama-cpp-sys-2`** (existing dep) | Same C++ engine that serves the decoder; no Python in runtime; in-process per-agent matches `architecture_unified_spu` |
| Candle fork | **Unchanged** | Stays in place for Jina V4 backend (one valid backend among several per per-profile pluggability); no fork updates needed for this swap |
| Vector index dimension | **Profile-driven, not const** | Current `BELIEF_EMBEDDING_INDEX.dimension = 2048` const must become builder-constructed from agent's spu config |
| Existing dev-DB content | **Rebuild required** | Dimension change 2048 → 1024 invalidates the HNSW index; all current dev agents are throwaway per the brief, so no migration path needed |

## 3. Reference-count summary

```
jina-v4 / jina_v4 / JinaV4 references: 432 (most legitimate — see §4)
2048 dimension references (in .rs files): 62 (~12 are framework leaks — see §5)
EmbedderClient construction sites: 11 (need llama.cpp sibling — see §6)
Agent YAML files with spu.encoder declarations: 20 (config update — see §7)
Test fixtures asserting Jina specifics: 77 (update or remove — see §8)
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

## 4. Legitimate references (stay as-is)

Per the brief: "No hardcoded references to Jina V4 model identifier or 2048 dimension remain **outside** weaver-spu (the embedder implementation layer)."

These directories are the embedder implementation layer — references stay:

| Location | Count | Why it stays |
|---|---|---|
| `crates/weaver-spu/src/encoder/` | 77 | The candle-Jina V4 backend implementation. Continues to exist as one valid backend in the per-profile world. |
| `crates/weaver-spu/src/models/` | 22 | Model definitions for the candle backend. |
| `crates/weaver-spu/src/decoder/` (50/-5) | 45 | Some of these are `multi_model.rs` server-config registrations of Jina V4 as one valid model — NOT decoder-side dependencies on Jina V4. See §5 for the few that are actual leaks. |
| `crates/weaver-spu/src/core/` | 25 | Internal supporting code for the encoder + decoder paths. Includes `probe.rs` which queries embedder dimension dynamically (legitimate). |
| `crates/weaver-spu/tests/` | 50 | Tests of the Jina V4 backend. Backend remains; tests remain. |
| `crates/weaver-spu/src/bin/jina_embed.rs` | (small) | Standalone Jina-specific binary; can stay as a Jina-specific tool. |
| `docs/specs/` | 74 | Documentation referencing Jina V4 as the historical default and one valid backend. Spec text update is a separate documentation PR; not blocking the implementation. |

## 5. Framework leaks (must fix in PR 2)

These are the actual violations of the "no model-specific assumptions outside the embedder implementation layer" principle:

### 5.1 `crates/weaver-database/src/graph/belief.rs:114-120` — **HIGHEST PRIORITY**

```rust
pub const BELIEF_EMBEDDING_INDEX: BeliefEmbeddingIndexDef = BeliefEmbeddingIndexDef {
name: "embedding_hnsw",
field: BELIEF_EMBEDDING_FIELD,
dimension: 2048, // ← hardcoded; framework leak
metric: "cosine",
n_lists: 64,
};
```

**Fix:** replace the const with a builder function `belief_embedding_index_for_dim(dim: usize) -> BeliefEmbeddingIndexDef`. Callers in `ensure_belief_indexes` (and equivalents) take the agent's configured dim from the spu config and construct the index def at provisioning time.

Companion changes in this file (lines 47, 66, 99, 287, 402, 454, 539): doc comments and test fixtures referencing "2048-dim Jina V4 embedding" — update text to "embedder-configured dim" and parameterize the test vector lengths.

### 5.2 `crates/weaver-interface/src/harness.rs:207-250`

Test fixtures asserting `model: "jina-v4"`, `dimension: 2048`. These test that the harness initializes correctly with whatever embedder is configured — they shouldn't assert specific model/dim values, they should assert against the runtime's reported model/dim.

**Fix:** restructure the assertions to pin "model and dim match the configured embedder" rather than "model is jina-v4 and dim is 2048".

### 5.3 `crates/weaver-interface/src/toml_edit.rs:2 hits` + `crates/weaver-interface/src/serve.rs:2 hits`

Config-writing paths that emit `embedder.snapshot` referencing Jina V4 weights. Update default template to point at Qwen3-Embedding-0.6B-Q8_0 GGUF path; keep field substrate-agnostic.

### 5.4 `crates/weaver-database/src/graph/runtime_schema.rs:3 hits`

Schema-side references to 2048. Same fix pattern as §5.1: parameterize on the runtime-resolved dim.

### 5.5 `crates/weaver-database/src/graph/episode.rs:2 hits`

Episode-graph schema. Same parameterization pattern.

### 5.6 Test fixtures (high-priority subset)

| File | Action |
|---|---|
| `crates/weaver-database/tests/config_integration.rs:2 hits` | Update fixtures to use new default; relax `dim == 2048` assertion to "dim matches the configured embedder" |
| `crates/weaver-core/tests/reconciliation_prompt_corpus.rs:2 hits` | Similar |
| Other test files cited in §3 | Per-file review during PR 2 |

## 6. EmbedderClient construction sites

The current `EmbedderClient` in `weaver-spu/src/encoder/client.rs` is candle-based and Jina V4–specific. PR 2 adds a sibling for the GGUF/llama.cpp path:

| Call site | File | Action |
|---|---|---|
| Daemon load handler | `crates/weaver-interface/src/server.rs:975, +3 more` | Dispatch on agent's spu config: Jina V4 profile → candle path; Qwen3-Embedding profile → llama.cpp path |
| Server entrypoint | `crates/weaver-interface/src/serve.rs:211, +1` | Same dispatch |
| Bench fixtures | `crates/weaver-demo/src/herobench/benchmark.rs:1078, +1` + `crates/weaver-demo/tests/herobench_integration.rs:2512, +1` | Update to use new default; remain Jina-compatible for backward profile testing |
| Trait impl declaration | `crates/weaver-spu/src/encoder/client.rs:131` | Stays — sibling `LlamaCppEmbedderClient` added alongside |

**Approach:** introduce a `LlamaCppEmbedderClient` in `weaver-spu/src/encoder/llama_cpp_client.rs` implementing the same `Embedder` trait. The daemon's load procedure picks which one to construct based on the agent's `spu.encoder.model_id` (or backend hint). Both coexist; per-profile pluggability is the dispatcher.

## 7. Agent YAML templates

20 yaml files reference `spu.encoder`. For PR 2, two are operationally important:

| File | Action |
|---|---|
| `agents/herobench-benchero-*.yaml` (existing bench cohort) | Leave alone — these are throwaway artifacts from earlier runs. No re-embedding migration. |
| New default template (TBD) | Reference `Qwen3-Embedding-0.6B-Q8_0.gguf` path, `model_id: "qwen3-embedding-0.6b-gguf"`, `context_size: 8192` (conservative; native is 32K, 8K matches our use cases). |
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated
| `test_agent_qwen3` (new, for tomorrow's smoke test) | New file using the default template above. |

## 8. Documentation updates (best-effort, separate from blocking work)

| Spec | Action |
|---|---|
| `belief-nodes-embedding-Spec.md` §2.1, §3.5 | Decision log entry; update example dim values; update "Jina V4 → 2048" to "embedder-configured" |
| `memory-substrate-Spec.md` §3.5, §4.2 | Decision log entry |
| `embedder-service.md` | Update default-embedder reference |

These can land in PR 2 or a follow-up doc-only PR — not blocking the implementation.

## 9. Deferred items (logged, not in this PR)

Per the brief: "If something tempting to refactor surfaces during the audit, log it for a follow-up rather than absorbing it into this change."

| Item | Why deferred |
|---|---|
| candle-native Qwen3-Embedding implementation | Out of scope; llama.cpp via `llama-cpp-sys-2` is sufficient and ships fast |
| BERT routing model integration | Out of scope per brief |
| nomic-embed implementation for Gemma agents | Out of scope for PR 2 (test_agent2 is Qwen3 anyway); add when Gemma agent provisioning is needed |
| `weaver-embedding` crate cleanup (legacy gRPC client artifacts) | Cosmetic; can wait |
| Spec text updates beyond decision-log entries | Doc-only follow-up |

## 10. Decisions (locked 2026-05-12 morning)

1. **Context size: 32K** (native). Don't truncate artificially even though our current content fits in 8K — preserving native context will matter later for late-chunking workloads and long-document retrieval. Agent YAML defaults to `context_size: 32768`.

2. **GGUF install path**: `/opt/weaver/models/qwen3-embedding-0.6b/Qwen3-Embedding-0.6B-Q8_0.gguf` — mirrors the existing `/opt/weaver/models/jina-embeddings-v4-*.gguf` convention in `multi_model.rs:1332`. PR 2 includes the operator step of moving the file from the current HF-cache location at `/opt/weaver/huggingface/hub/gguf-models/qwen3-embedding-0.6B/` (download artifact) to the canonical install path (production location).

3. **Encoder/decoder GPU placement**: same-GPU per agent as the default (Plan A — `architecture_unified_spu`), **configurable** via `spu.encoder.gpu` and `spu.decoder.gpu` independent fields. No code-side hard-coding of co-residency; the daemon's load procedure honors whatever the agent YAML declares. Plan B (split-GPU, one agent at a time) is operator-selectable per agent without code changes.

## 11. PR 2 implementation outline

Once decisions in §10 are made, PR 2 ships in this order:

1. **`LlamaCppEmbedderClient`** in `weaver-spu/src/encoder/llama_cpp_client.rs`. Implements `Embedder` trait. Uses `llama-cpp-sys-2`. Same shape as existing `EmbedderClient` but constructed from a GGUF path instead of a candle snapshot.
2. **Backend dispatch** in `server.rs::load_per_agent_embedder`: switch on agent's `spu.encoder.model_id` / backend hint. Constructs the right client.
3. **Schema-side dim parameterization**: replace `BELIEF_EMBEDDING_INDEX` const with builder. Update callers.
4. **Default agent YAML template**: new `agents/_default-qwen3.yaml` (or update existing template).
5. **Test fixture updates**: `config_integration.rs`, `reconciliation_prompt_corpus.rs`, `harness.rs`, others.
6. **Decision log entries** in `belief-nodes-embedding-Spec.md` + `memory-substrate-Spec.md`.
7. **Rebuild doc**: brief note explaining how to migrate an existing dev agent (destroy + recreate; no in-place re-embedding).

## 12. Success criteria (mirrors brief)

- [ ] All currently running tests pass (or are appropriately updated)
- [ ] A new development agent can be provisioned with the new embedder, exercised through a HeroBench session, and produce expected memory writes and retrievals
- [ ] No hardcoded references to Jina V4 model identifier or 2048 dimension remain outside `weaver-spu` (the embedder implementation layer)
- [ ] Documentation reflects the new development-default embedder
- [ ] Decision log entry in `belief-nodes-embedding-Spec.md` noting rationale, scope, and deferred items