Skip to content

Roadmap: ir — agentic IR substrate on ef + vd #1

Description

@thorwhalen

Vision

ir is an information-retrieval substrate for agentic systems: one uniform "find the relevant things in this corpus" contract that scales from an ad-hoc find over an ephemeral list to a search engine over a maintained, living corpus. Retrieval is the core competency; generation/reranking/selection/citation are layered on top. It is extensible into RAG without being one by default.

See misc/docs/ir_04 (architecture & reuse analysis) and the three research docs ir_01 (capability discovery), ir_02 (indexing/embedding strategy), ir_03 (evaluation).

Reuse stance — compose, do not reinvent

  • ef (Embedding Flow) = the indexing/maintenance/retrieval spine: content_hash, ChangeDetectingCorpus, segmenters, embedder facade + adapters (sentence_transformers_embedder, HashingEmbedder), CachedEmbedder, artifact-graph (heavy maintenance, deferred), reranking, evaluation.
  • vd = the vector store + metadata-filter layer: Mongo-style filters (matches_filter), hybrid BM25+dense, reciprocal_rank_fusion, 16 backends (deferred until scale demands).
  • dol = the key-value persistence layer (repository pattern, MutableMapping views).
  • oa = LLM ops for AI-authored surfaces (synopsis / problem-class tags), modeled as cached producers.
  • Sources already exist: priv.skills_index (skills), projreg/hubcap/contaix (packages, docs, GitHub artifacts).

Architecture (module map)

ir.config      XDG dirs + defaults + named-corpus registry
ir.base        Artifact, Surface, FilterFields, Record, SearchHit
ir.sources     CorpusSource + Scope/ChangeSignal protocols + smart-default constructors
ir.strategy    IndexingStrategy.decompose(artifact) -> {filter_fields, surfaces}
ir.embed       embedder resolution: 'default'=local MiniLM, 'light'=hashing, +cache
ir.store       repository: dol-backed CorpusStore (meta / vectors / ledger) under XDG
ir.index       pipeline: enumerate -> decompose -> embed -> persist; incremental maintenance
ir.retrieve    hard metadata filter + dense brute-force + artifact dedupe (hybrid/rerank seams)
ir.select      selection stage (distractor-robust commit) + progressive disclosure [later]
ir.eval        capability-discovery eval harness (ir_03) [later]
ir.cli         argh CLI + facade

Core abstraction — defining a corpus source

A CorpusSource is an abstract strategy + parameters, with smart defaults:

CorpusSource(
  name,
  scope,                 # MutableMapping[id -> raw] (folder / dict / dol store / callable)
  indexing_strategy = WholeText(),   # raw -> {filter_fields, surfaces}; smart default
  change_signal     = ContentHash(), # raw -> version; default content hash
  embedder          = 'default',     # 'default'=MiniLM(local), 'light'=hashing
  store             = <XDG dol store under share/ir/<name>>,
)

"Various useful ways to define a source" = constructors: from_files, from_mapping, from_skills, from_packages, from_md_reports.

IndexingStrategy is the "what do we index?" seam (the genuinely-new core, per ir_04 §4): one artifact decomposes into filter_fields (hard-filter metadata: name, ownership, tags) and surfaces (heterogeneous embeddable units: description / AI synopsis / problem-classes / chunks). Defaults: WholeText, Chunked, Skill, Package.

Data & persistence organization (repository pattern via dol + XDG)

  • ~/.config/ir/ — configs + named-corpus registry.
  • ~/.local/share/ir/<corpus>/ — durable: meta/ (record metadata, JSON), vectors/ (numpy), ledger.json (artifact -> version + surface ids).
  • ~/.cache/ir/embeddings/<model>/ — embedding cache keyed by (model, content_hash); regenerable.

All key-value views are dol MutableMappings → swap persistence by swapping the store. Default store = local files; brute-force search loads vectors into a numpy matrix.

Sizing → light by default

Target corpora are small: skills ≈ 157, packages ≈ 231, md-reports ≈ 98 files (~1.4 MB). Brute-force cosine is exact and instant at this scale — no vector DB needed. The vd-backed / artifact-graph paths are the documented upgrade for when a corpus outgrows brute force.

Embedding policy

  • Default = decent local: all-MiniLM-L6-v2 (384-dim) via ef.embedder_adapters.sentence_transformers_embedder, wrapped in CachedEmbedder. Requires USE_TF=0 (avoids a TensorFlow/numpy ABI crash on import).
  • Light = hashing: ef.HashingEmbedder (numpy-only) for fast tests where semantic power is not what's under test.
  • Graceful fallback to hashing (with a warning) if sentence-transformers is unavailable.

Plan / tracking

Component issues (each carries its own decision log as comments):

  • Foundation: config/XDG + core types
  • Persistence: dol-backed CorpusStore (meta/vectors/ledger)
  • Embedding resolution (local default + light + cache)
  • CorpusSource + Scope/ChangeSignal/IndexingStrategy + smart-default constructors
  • Indexing pipeline + incremental maintenance
  • Retrieval (dense brute-force + hard filter + dedupe; hybrid/rerank seams)
  • Use case: md-reports corpus + test
  • Use case: skills corpus + test
  • Use case: packages corpus (multi-surface) + test
  • Selection stage + capability-discovery surface (ir_01) [later]
  • Eval harness (ir_03) [later]
  • CLI + facade + named-corpus registry

Decisions are logged as comments on the relevant issue as work lands.


2026-06-12 — checklist reconciled (all v1 items above shipped across ir 0.1.x; left unchecked by oversight). Next capability arc, per ADR #43 (one linked-artifact substrate, three operators):

Flat top-k + rerank remains the default; traversal policies must beat it on eval before promotion (#43).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions