Skip to content

fix(mcp): fail loudly when embedder initialization fails#1414

Open
chrisgscott wants to merge 2 commits intogetzep:mainfrom
chrisgscott:fix/embedder-init-hard-failure
Open

fix(mcp): fail loudly when embedder initialization fails#1414
chrisgscott wants to merge 2 commits intogetzep:mainfrom
chrisgscott:fix/embedder-init-hard-failure

Conversation

@chrisgscott
Copy link
Copy Markdown

Summary

  • Embedder initialization failure is now a hard error instead of a warning that silently continues with embedder=None
  • Added search health check at startup that verifies the embedder can produce vectors
  • Added defensive checks in search_nodes and search_memory_facts that return a clear error if embedder is unexpectedly None

Problem

When the embedder client fails to initialize (invalid API key, network issue, etc.), the MCP server logs a warning and continues with embedder_client = None. This causes search_nodes and search_memory_facts to silently return empty results — the hybrid search (NODE_HYBRID_SEARCH_RRF) needs the embedder to generate query vectors for the cosine similarity leg, and without it, RRF reranking produces nothing.

The failure mode is especially confusing because add_memory continues to work normally (it uses the LLM client, not the embedder). Users see data being written successfully but get empty results on every search, with no error indicating why.

Test plan

  • Start the MCP server with a valid embedder config — verify "Search health check passed" in logs
  • Start the MCP server with an invalid/missing API key — verify it fails at startup with a clear error instead of silently degrading
  • Call search_nodes when embedder is available — verify results are returned normally
  • If embedder is somehow None at search time, verify the tool returns an explicit error message instead of empty results

🤖 Generated with Claude Code

chrisgscott and others added 2 commits March 31, 2026 15:47
Race condition: after laptop restart, Claude Code initializes MCP servers
before Docker Desktop finishes starting FalkorDB. Server would throw
RuntimeError and get silently dropped. Now retries 5 times with exponential
backoff (2s, 4s, 8s, 16s, 32s) for connection refused errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously, if the embedder client failed to initialize (e.g., invalid
API key, network issue), the MCP server logged a warning and continued
with embedder=None. This caused search methods (search_nodes,
search_memory_facts) to silently return empty results instead of errors,
since hybrid search requires embeddings to generate query vectors.

The failure mode was especially confusing because writes (add_memory)
continued to work normally — they use the LLM client, not the embedder.
Users saw data being written successfully but could never read it back.

Changes:
- Embedder initialization now raises RuntimeError on failure instead of
  logging a warning and continuing
- Added search health check at startup that verifies the embedder can
  produce vectors end-to-end
- Added defensive checks in search_nodes and search_memory_facts that
  return a clear error if the embedder is unexpectedly None

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@danielchalef
Copy link
Copy Markdown
Member


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Copy link
Copy Markdown

@xkonjin xkonjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good iterative improvements here. Some concerns before merging:

  1. Critical: Silent fallback on embedder health check. In initialize(), if embedder_client.create(['health check']) throws, you catch it and log a warning that "Search may not work correctly." But embedder_client was already created successfully above — this branch means the runtime embedder call failed (network, quota, auth). That is a real failure mode. Logging a warning and continuing means users will see confusing search errors later instead of a clear startup failure. Consider whether this should be a hard error, or at minimum expose a /health endpoint that reports it so operators can detect it.

  2. Model string assumptions. model.startswith('gpt-5') will match any future gpt-5.x model, which is okay today, but it is fragile if OpenAI releases a non-reasoning gpt-5 variant or changes naming. There is no action needed now, but consider centralizing reasoning-model detection into a shared helper so you don't have to update every startswith call later.

  3. Missing tests for token param mapping. _create_completion now switches between max_tokens and max_completion_tokens based on is_reasoning_model. Add a unit test that asserts the correct kwarg is sent for both reasoning and non-reasoning model strings.

  4. Retry loop doubles imports. You import asyncio as _asyncio inside the retry loop body. Move that to the top of the file with the other imports.

  5. Config YAMLs contain hardcoded user. config-second-brain.yaml and config-second-brain-stdio.yaml both set user_id: chris. If these are committed examples, they should either be parameterized (${USER_ID:chris} already exists in the same file) or the hardcoded default should be something neutral like default_user.

Fix the import placement and decide on the embedder health-check failure semantics before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants