Skip to content

fix: reranked websearch content don't overload the model anymore#63

Open
dan-and wants to merge 4 commits intodanny-avila:mainfrom
dan-and:main
Open

fix: reranked websearch content don't overload the model anymore#63
dan-and wants to merge 4 commits intodanny-avila:mainfrom
dan-and:main

Conversation

@dan-and
Copy link
Copy Markdown
Contributor

@dan-and dan-and commented Mar 1, 2026

Hi Danny,
I was searching for why my local websearch, webcrawler and reranker always killed my model context, which led to completely lobotomized models as the context/kv-cache got overwritten. This is already 6 months ago ( see https://www.reddit.com/r/LocalLLaMA/comments/1mucj1p/which_models_are_suitable_for_websearch/ )

I had to dig deep to figure out what is going on in the complex of Librechat -> agent and flow between searXNG, Firecrawl, Jina-reranker and local models which usually have a limited amount of kv-cache.

I used the last weeks to add more and more debug code to the agents code and found this:

First fix: expandHighlights sending whole content.

With default expandHighlights parameters (mainExpandBy=300,
separatorExpandBy=150), every web search query overflowed the context window:

[formatResultsForLLM] output size: 3,204,513 chars, ~**801,128 tokens**
-> empty_messages: Message pruning removed all messages as none fit in the context window
[formatResultsForLLM] output size: 2,174,069 chars, ~**543,517 tokens**
-> empty_messages: Message pruning removed all messages as none fit in the context window

Per-source JSON size inspection confirmed sources were carrying full page content:
individual source sizes in the hundreds of thousands of characters.

expandHighlights() had an early-return path for sources with no highlights that returned the original result object unmodified, including the full raw scraped page content from Firecrawl. This caused the entire scraped content of those pages to pass through into formatResultsForLLM and be sent to the LLM as part of the tool output, producing context window overflows on every web search query.

What I found:

In src/tools/search/highlights.ts, the expandHighlights() function maps over organic results and top stories to expand highlight boundaries. For sources that had no highlights (either because the reranker returned nothing, the source failed to
scrape, or content was empty), the function returned early:

// Before
if (
  result.content == null ||
  result.content === '' ||
  !result.highlights ||
  result.highlights.length === 0
) {
  return result; // ← returned the original object, full content included
}

The normal code path (when highlights exist) correctly creates a shallow copy and
deletes content:

const resultCopy = { ...result };
// ... process highlights ...
delete resultCopy.content;
return resultCopy;

But the early-return path bypassed this entirely, passing result.content the full scraped markdown from Firecrawl, potentially hundreds of kilobytes per page directly into the output.

The Change

Strip content from the result in the early-return path before returning:

// After
if (
  result.content == null ||
  result.content === '' ||
  !result.highlights ||
  result.highlights.length === 0
) {
  const { content: _content, ...resultWithoutContent } = result;
  return resultWithoutContent as typeof result;
}

content is now always removed before a result leaves expandHighlights(),
regardless of whether highlights exist.

Second fix: content passed to the reranker was not capped

While chasing the overflow, I found a second problem that became visible once the early-return content leak was closed.

With the first fix in place, sources that do have highlights now correctly delete content after expansion. But content was still being stored uncapped on the ScrapeResult object before ever reaching expandHighlights. A large page with around 80k respectively 200k, or more chars was stored in full, then expandHighlights ran content.indexOf(highlight.text) and boundary searches across the entire string. More critically, getHighlights() was also receiving the full uncapped content and handing it to RecursiveCharacterTextSplitter, which would happily produce hundreds of small chunks and send all of them to the reranker in a single request.

The fix is in three places:

At storage time (search.ts scrapeMany): after cleaning the text, the content is capped before being stored on the result object:

// Before
content: chunker.cleanText(content),

// After
const cleanedContent = chunker.cleanText(content);
content: cleanedContent.length > maxContentLength
  ? cleanedContent.slice(0, maxContentLength)
  : cleanedContent,

Before chunking (search.ts getHighlights): even if the content arrived uncapped through a different code path, it gets capped again before the text splitter runs:

const cappedContent =
  content.length > maxContentLength
    ? content.slice(0, maxContentLength)
    : content;
const documents = await chunker.splitText(cappedContent);

The cap itself is set to 50000 chars by default, overridable per-deployment via the SEARCH_MAX_CONTENT_LENGTH environment variable. It is wired through ProcessSourcesConfig.maxContentLength -> createSourceProcessor -> both call sites above, so createSearchTool callers can tune it without touching the source.

Why 50000?

With the default chunkSize=150 and chunkOverlap=50, a 50 000-char page produces roughly 330 chunks. That is already a lot for a reranker to score. In practice, clean Firecrawl markdown from a well-structured page is 5–40k chars, so the cap rarely fires, but it prevents a degenerate case (minified content, dense tables, very long articles) from saturating the reranker or causing latency spikes.

Third fix: reducing news for SearXNG similar to Serper

After the first two fixes brought output sizes down to manageable levels, I figured out that SearXNG when news results are enabled, it is sending far too much (duplicated) news.

executeParallelSearches in tool.ts fires a main SearXNG search alongside an optional parallel news sub-search. Both return news items that will be merged into topStories. SearXNG's general search also includes its own news category which gets converted to topStories moreover. The result is entering processSources can carry 24–37 topStories items instead of the expected 5.

This is a SearXNG-specific issue. The Serper adapter already slices topStories to 5 before the data leaves createSerperAPI:

// Serper is already capped at source
const topStories = newsResults.slice(0, 5);

SearXNG has no equivalent cap, so all 24–37 items were scraped, reranked, and passed to formatResultsForLLM. In practice this produced n=32 sources entering expandHighlights and output of 114,771 chars / ~28,693 tokens which is enough to overflow smaller context windows and leave almost no room for conversation history or a model response.

The fix (search.ts processSources): slice topStories to numElements after updateSourcesWithContent, consistent with how organic results are already limited via the target parameter in fetchContents. The slice happens after enrichment so SearXNG's relevance ordering is preserved:

if (news && topStories.length > 0) {
  updateSourcesWithContent(topStories, sourceMap);
  result.data.topStories = topStories.slice(0, numElements); // ← added
}

After this fix, the same query dropped from n=32 to n=13 (5 organic + up to 5 capped topStories + a small number from the main search's own news field) and output fell to ~35,687 chars / ~8,922 tokens.

Infrastructure and codebase:

I tested this LibreChat 0.8.3rc1 and agents 3.1.52 (and last test with 3.1.53 and 3.1.54) against a self-hosted instance using:

  • Search: SearXNG (self-hosted)
  • Scraper: highly updated and optimized Firecrawl-simple (self-hosted, v1 API) (released soon, already in my github repos, but I needed Librechat to work first)
  • Reranker: open-reranker (self-hosted, Jina-compatible API) (released soon, but I needed Librechat to work first)
  • Model: qwen3-14b (41,000 token KV-cache context window)

Before the fixes

With default expandHighlights parameters (mainExpandBy=300,
separatorExpandBy=150), every web search query overflowed the context window:

[formatResultsForLLM] output size: 3,204,513 chars, ~801,128 tokens
-> empty_messages: Message pruning removed all messages as none fit in the context window
[formatResultsForLLM] output size: 2,174,069 chars, ~543,517 tokens
-> empty_messages: Message pruning removed all messages as none fit in the context window

Per-source JSON size inspection confirmed sources were carrying full page content: individual source sizes in the hundreds of thousands of characters.

After the fixes

With identical parameters and the same queries, output sizes dropped to normal:

[formatResultsForLLM] output size: 22,799 chars, ~5,700 tokens    ← broad navigational query
[formatResultsForLLM] output size: 115,161 chars, ~28,790 tokens  ← specific factual query

Per-source JSON sizes after the fixes: all sources between 176-655 chars (title + URL + snippet + highlights only, No content field). Both queries completed successfully, and the model produced correct, well-cited answers.

Why the first bug was challenging to spot

The bug was masked during an earlier investigation where I tried reducing the expansion parameters (mainExpandBy=150, separatorExpandBy=75). With smaller expansion, more sources successfully produced highlights and hit the normal code path (which correctly deletes content). Only when I restored the original default values did the early-return path become dominant, exposing the leak.

A parameter tuning session that appeared to "fix" the overflow by halving the expansion values was in fact masking this underlying bug. The actual fix is a 2-line change independent of any expansion parameter values.

Testing

I verified end-to-end on:

  • Broad navigational query ("search the web and tell me what are todays news from heise.de newsticker?"):
    pipeline completes, output ~5,700 tokens, model receives title/snippet/URL metadata only for sources without highlights (correct behavior)

  • Specific factual query ("Was ist mit dem Entwickler von Moltbot geschehen?"):
    pipeline completes, output ~28,790 tokens (within 41k context window), model produces accurate answer with correct source citations, name history (ClawdBot -> Moltbot -> OpenClaw), security context, and project future

No regressions are observed in sources that do have highlights. The normal expansion and content deletion path is unchanged. With the content cap in place, reranker request sizes stay bounded regardless of page length.

dan-and and others added 4 commits March 1, 2026 02:13
…ighlights

Sources that had no reranker highlights were returned unchanged from
expandHighlights(), leaking full scraped page content into the LLM
tool output and overflowing the context window.
Prevents the reranker from receiving hundreds of chunks from large pages
by capping content at 50k chars (SEARCH_MAX_CONTENT_LENGTH env override)
both at storage time on ScrapeResult and before passing to the text splitter.
When using SearXNG, executeParallelSearches merges the main search
news results and an optional parallel news sub-search into topStories,
producing 24-37 items with no upper bound. Unlike Serper, which already
slices topStories to 5 before returning, SearXNG accumulates items
unboundedly through the merge in tool.ts.

Without this cap, all 24-37 sources are scraped, reranked, and passed
to formatResultsForLLM — observed as n=32 in expandHighlights and
114k chars / ~28k tokens of output, causing context window overflow.

Slice to numElements (default 5) after updateSourcesWithContent,
consistent with how organic results are limited via the target
parameter in fetchContents. The slice happens after enrichment so
relevance ordering from SearXNG is preserved.
@danny-avila
Copy link
Copy Markdown
Owner

Hi Dan, thanks for the PR! Makes a lot of sense, and doesn't seem to require a lot of changes. Should be a quick review

@rio100
Copy link
Copy Markdown

rio100 commented Mar 2, 2026

The idea to help local models which often times have limitations compared to K2.5, GLM-5, and DeepSeek-V3.2; I'm hoping this PR does not chop down the capability of these models with WebSearch.

I get the impression allowing a throttle setting to address model context constraints could help here.

In my setup I host firecrawl, searx and my own reranker-proxy locally. I've been using it for months, including lc-8.3-rc1 without issue. My reranker-proxy massages the current format and sends it to DeepInfra's reranker. And have implemented the following as firecrawl options:

  firecrawlOptions:
    formats:
      - "markdown"
    includeTags: ["main", "article", ".content"]
    excludeTags: ["nav", "footer", ".ads"]

   # waitFor: 23000
   # timeout: 27000
    mobile: false
    blockAds: true
    onlyMainContent: false
    location:
      country: "US"
      languages: ["en"]

Have allocated over 24GB to docker services to perform the webSearch task, drawing on many search engines.

Apologies, if I misunderstood the intention of this PR. Just feel like it could constrain something its doing right for my WebSearch use-case.

@dan-and
Copy link
Copy Markdown
Contributor Author

dan-and commented Mar 2, 2026

Hi rio100 . It isn't chopping down anything useful. The web_search tools pretty great with any model of 32.000 ctx-size and larger with this solution. As you can see, I saw for example, 4-5x the same news information flowing into the model and overflowing even 500.000 tokens. I am happy to share my solution with self-hosted reranker, searXNG and firecrawl. we can compare them and find a good solid ground.

@rio100
Copy link
Copy Markdown

rio100 commented Mar 2, 2026

If it doesn't cause constraints for more capable models in chewing relevant info to any degree, I'm all for it. So far I haven't hit any limitation with the current setup, including when I increase firecrawl pool settings to grab more info, at least from what I'm seeing. I like seeing 50+ sources. You know perplexity stalls you at number much lower with model entrapment, unless you pay dearly for subpar quality in comparison to what I get.

I believe there is a need to ensure smaller models running on ollama or vllm can perform WebSearch without issue, just gotta be careful not to stifle use-cases that larger models can handle.

In the EU the AI act requires a lot of privacy measures, often times forcing companies to adopt a local air-gapped solution. And for the smallest of business operations, a local model option that can run on ollama or vllm should be a targeted design for LC's WebSearch feature, but not at the mercy of breaking what already works.

Hi rio100 . It isn't chopping down anything useful. The web_search tools pretty great with any model of 32.000 ctx-size and larger with this solution. As you can see, I saw for example, 4-5x the same news information flowing into the model and overflowing even 500.000 tokens. I am happy to share my solution with self-hosted reranker, searXNG and firecrawl. we can compare them and find a good solid ground.

@rio100
Copy link
Copy Markdown

rio100 commented Mar 2, 2026

Made some edits to my previous post. Hoping its more legible. Need more coffee its the morning here. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants