Roadmap: ir — agentic IR substrate on ef + vd

## Vision

`ir` is an **information-retrieval substrate for agentic systems**: one uniform "find the relevant things in this corpus" contract that scales from an ad-hoc `find` over an ephemeral list to a search engine over a maintained, living corpus. Retrieval is the core competency; generation/reranking/selection/citation are layered on top. It is **extensible into RAG without being one by default**.

See `misc/docs/ir_04` (architecture & reuse analysis) and the three research docs `ir_01` (capability discovery), `ir_02` (indexing/embedding strategy), `ir_03` (evaluation).

## Reuse stance — compose, do not reinvent

- **`ef`** (Embedding Flow) = the indexing/maintenance/retrieval spine: `content_hash`, `ChangeDetectingCorpus`, segmenters, embedder facade + adapters (`sentence_transformers_embedder`, `HashingEmbedder`), `CachedEmbedder`, artifact-graph (heavy maintenance, deferred), `reranking`, `evaluation`.
- **`vd`** = the vector store + metadata-filter layer: Mongo-style filters (`matches_filter`), hybrid BM25+dense, `reciprocal_rank_fusion`, 16 backends (deferred until scale demands).
- **`dol`** = the key-value persistence layer (repository pattern, `MutableMapping` views).
- **`oa`** = LLM ops for AI-authored surfaces (synopsis / problem-class tags), modeled as cached producers.
- **Sources** already exist: `priv.skills_index` (skills), `projreg`/`hubcap`/`contaix` (packages, docs, GitHub artifacts).

## Architecture (module map)

```
ir.config      XDG dirs + defaults + named-corpus registry
ir.base        Artifact, Surface, FilterFields, Record, SearchHit
ir.sources     CorpusSource + Scope/ChangeSignal protocols + smart-default constructors
ir.strategy    IndexingStrategy.decompose(artifact) -> {filter_fields, surfaces}
ir.embed       embedder resolution: 'default'=local MiniLM, 'light'=hashing, +cache
ir.store       repository: dol-backed CorpusStore (meta / vectors / ledger) under XDG
ir.index       pipeline: enumerate -> decompose -> embed -> persist; incremental maintenance
ir.retrieve    hard metadata filter + dense brute-force + artifact dedupe (hybrid/rerank seams)
ir.select      selection stage (distractor-robust commit) + progressive disclosure [later]
ir.eval        capability-discovery eval harness (ir_03) [later]
ir.cli         argh CLI + facade
```

## Core abstraction — defining a corpus source

A `CorpusSource` is **an abstract strategy + parameters**, with smart defaults:

```
CorpusSource(
  name,
  scope,                 # MutableMapping[id -> raw] (folder / dict / dol store / callable)
  indexing_strategy = WholeText(),   # raw -> {filter_fields, surfaces}; smart default
  change_signal     = ContentHash(), # raw -> version; default content hash
  embedder          = 'default',     # 'default'=MiniLM(local), 'light'=hashing
  store             = <XDG dol store under share/ir/<name>>,
)
```

"Various useful ways to define a source" = constructors: `from_files`, `from_mapping`, `from_skills`, `from_packages`, `from_md_reports`.

`IndexingStrategy` is the **"what do we index?" seam** (the genuinely-new core, per ir_04 §4): one artifact decomposes into **filter_fields** (hard-filter metadata: name, ownership, tags) and **surfaces** (heterogeneous embeddable units: description / AI synopsis / problem-classes / chunks). Defaults: `WholeText`, `Chunked`, `Skill`, `Package`.

## Data & persistence organization (repository pattern via dol + XDG)

- `~/.config/ir/` — configs + named-corpus registry.
- `~/.local/share/ir/<corpus>/` — durable: `meta/` (record metadata, JSON), `vectors/` (numpy), `ledger.json` (artifact -> version + surface ids).
- `~/.cache/ir/embeddings/<model>/` — embedding cache keyed by (model, content_hash); regenerable.

All key-value views are `dol` `MutableMapping`s → swap persistence by swapping the store. Default store = local files; brute-force search loads vectors into a numpy matrix.

## Sizing → light by default

Target corpora are small: **skills ≈ 157**, **packages ≈ 231**, **md-reports ≈ 98 files (~1.4 MB)**. Brute-force cosine is exact and instant at this scale — no vector DB needed. The `vd`-backed / artifact-graph paths are the documented upgrade for when a corpus outgrows brute force.

## Embedding policy

- **Default = decent local**: `all-MiniLM-L6-v2` (384-dim) via `ef.embedder_adapters.sentence_transformers_embedder`, wrapped in `CachedEmbedder`. Requires `USE_TF=0` (avoids a TensorFlow/numpy ABI crash on import).
- **Light = hashing**: `ef.HashingEmbedder` (numpy-only) for fast tests where semantic power is not what's under test.
- Graceful fallback to hashing (with a warning) if sentence-transformers is unavailable.

## Plan / tracking

Component issues (each carries its own decision log as comments):
- [x] Foundation: config/XDG + core types
- [x] Persistence: dol-backed CorpusStore (meta/vectors/ledger)
- [x] Embedding resolution (local default + light + cache)
- [x] CorpusSource + Scope/ChangeSignal/IndexingStrategy + smart-default constructors
- [x] Indexing pipeline + incremental maintenance
- [x] Retrieval (dense brute-force + hard filter + dedupe; hybrid/rerank seams)
- [x] Use case: md-reports corpus + test
- [x] Use case: skills corpus + test
- [x] Use case: packages corpus (multi-surface) + test
- [x] Selection stage + capability-discovery surface (ir_01) [later]
- [x] Eval harness (ir_03) [later]
- [x] CLI + facade + named-corpus registry

Decisions are logged as comments on the relevant issue as work lands.



---

**2026-06-12 — checklist reconciled** (all v1 items above shipped across ir 0.1.x; left unchecked by oversight). Next capability arc, per ADR #43 (*one linked-artifact substrate, three operators*):

- [ ] Context expansion: #44 (surface_index + sibling addressing) → #45 (`expand(hit)` + neighborhood policies)
- [ ] Linked retrieval: #46 (typed-edge `links` view + GraphStore protocol) → #47 (`traverse()` + WalkPolicy, collapsed-tree first) → #48 (synopsis surfaces)
- [ ] Purpose-centric memory lives in the agent layer: thorwhalen/raglab#6

Flat top-k + rerank remains the default; traversal policies must beat it on eval before promotion (#43).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Roadmap: ir — agentic IR substrate on ef + vd #1

Vision

Reuse stance — compose, do not reinvent

Architecture (module map)

Core abstraction — defining a corpus source

Data & persistence organization (repository pattern via dol + XDG)

Sizing → light by default

Embedding policy

Plan / tracking

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Roadmap: ir — agentic IR substrate on ef + vd #1

Description

Vision

Reuse stance — compose, do not reinvent

Architecture (module map)

Core abstraction — defining a corpus source

Data & persistence organization (repository pattern via dol + XDG)

Sizing → light by default

Embedding policy

Plan / tracking

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions