Skip to content

Kernel-ML/opensemanticsearch

Repository files navigation

opensemanticsearch

Vendor-neutral semantic search pipeline toolkit using open-source components.

Companion open-source implementation for the paper:

Designing Vendor-Neutral Semantic Search Pipelines Using Open-Source Embedding Models and FAISS World Journal of Advanced Engineering Technology and Sciences (WJAETS), 2026, 18(3) DOI: 10.30574/wjaets.2026.18.3.0038

Overview

Most semantic search implementations today are locked into proprietary ecosystems: OpenAI embeddings with Pinecone, Cohere with Weaviate, or cloud-specific vector databases. Switching providers means re-embedding entire corpora and rewriting integration code.

This library provides a swap-and-compare framework for embedding models, FAISS index selection guidance, a complete on-premise pipeline from document ingestion through retrieval and evaluation, and a migration planner for teams moving from proprietary stacks to open-source alternatives.

Modules

Module Purpose
opensemanticsearch.embed Embedding model abstraction with multi-model benchmarking
opensemanticsearch.index In-memory vector index (cosine/dot) and FAISS index type advisor
opensemanticsearch.retrieve Semantic search, BM25 keyword scoring, and hybrid search
opensemanticsearch.evaluate NDCG, Recall, and MRR evaluation over query-relevance pairs
opensemanticsearch.migrate Migration planner from proprietary to open-source stacks

Installation

pip install opensemanticsearch

Or with UV:

uv add opensemanticsearch

Quick Start

import numpy as np
from opensemanticsearch.index.manager import InMemoryIndexManager
from opensemanticsearch.retrieve.engine import SearchEngine

# Build an index
index = InMemoryIndexManager(metric="cosine")
index.add(doc_ids, doc_embeddings)

# Search
engine = SearchEngine(index=index)
results = engine.search(query_vector, top_k=10)
for r in results:
    print(f"rank={r.rank} score={r.score:.3f} id={r.doc_id}")

Hybrid Search

from opensemanticsearch.retrieve.engine import BM25Scorer, HybridSearchEngine

bm25 = BM25Scorer()
bm25.add_documents(doc_ids, documents)

hybrid = HybridSearchEngine(
    semantic_engine=engine,
    bm25_scorer=bm25,
    semantic_weight=0.7,
)
results = hybrid.search(query_vector, query_text="bank reconciliation quickbooks", top_k=10)

Index Advisor

from opensemanticsearch.index.advisor import IndexAdvisor, IndexAdvisorConfig

advisor = IndexAdvisor(IndexAdvisorConfig(
    corpus_size=5_000_000,
    embedding_dim=384,
    memory_budget_gb=8,
    latency_target_ms=10,
))
rec = advisor.recommend()
print(rec.summary())
# Recommended index: IVF4096,PQ48
# Memory: 1.86 GB | Latency: 15.0ms | Recall: 92%

Evaluation

from opensemanticsearch.evaluate.evaluator import RetrievalEvaluator, QueryResult

evaluator = RetrievalEvaluator(k_values=[1, 5, 10])
results = evaluator.evaluate([
    QueryResult("q1", retrieved_ids=["d1", "d2", "d3"], relevant_ids={"d1", "d3"}),
])
print(results.summary())

Migration Planning

from opensemanticsearch.migrate.planner import MigrationPlanner, MigrationPlannerConfig

planner = MigrationPlanner(MigrationPlannerConfig(
    current_provider="openai",
    target_provider="open_source",
    corpus_size=2_000_000,
))
plan = planner.plan()
print(plan.summary())
# Monthly savings: $90 | Quality impact: -4.0%

Development

uv sync --all-extras
uv run pytest tests/ -v --cov=src
uv run isort src/ tests/ && uv run black src/ tests/

Citation

If you use this library in your research, please cite the paper:

Designing Vendor-Neutral Semantic Search Pipelines Using Open-Source
Embedding Models and FAISS.
World Journal of Advanced Engineering Technology and Sciences (WJAETS),
2026, 18(3). DOI: 10.30574/wjaets.2026.18.3.0038

License

Apache 2.0

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages