fix: reranked websearch content don't overload the model anymore#63
fix: reranked websearch content don't overload the model anymore#63dan-and wants to merge 4 commits intodanny-avila:mainfrom
Conversation
…ighlights Sources that had no reranker highlights were returned unchanged from expandHighlights(), leaking full scraped page content into the LLM tool output and overflowing the context window.
Prevents the reranker from receiving hundreds of chunks from large pages by capping content at 50k chars (SEARCH_MAX_CONTENT_LENGTH env override) both at storage time on ScrapeResult and before passing to the text splitter.
When using SearXNG, executeParallelSearches merges the main search news results and an optional parallel news sub-search into topStories, producing 24-37 items with no upper bound. Unlike Serper, which already slices topStories to 5 before returning, SearXNG accumulates items unboundedly through the merge in tool.ts. Without this cap, all 24-37 sources are scraped, reranked, and passed to formatResultsForLLM — observed as n=32 in expandHighlights and 114k chars / ~28k tokens of output, causing context window overflow. Slice to numElements (default 5) after updateSourcesWithContent, consistent with how organic results are limited via the target parameter in fetchContents. The slice happens after enrichment so relevance ordering from SearXNG is preserved.
|
Hi Dan, thanks for the PR! Makes a lot of sense, and doesn't seem to require a lot of changes. Should be a quick review |
|
The idea to help local models which often times have limitations compared to K2.5, GLM-5, and DeepSeek-V3.2; I'm hoping this PR does not chop down the capability of these models with WebSearch. I get the impression allowing a throttle setting to address model context constraints could help here. In my setup I host firecrawl, searx and my own reranker-proxy locally. I've been using it for months, including lc-8.3-rc1 without issue. My reranker-proxy massages the current format and sends it to DeepInfra's reranker. And have implemented the following as firecrawl options: Have allocated over 24GB to docker services to perform the webSearch task, drawing on many search engines. Apologies, if I misunderstood the intention of this PR. Just feel like it could constrain something its doing right for my WebSearch use-case. |
|
Hi rio100 . It isn't chopping down anything useful. The web_search tools pretty great with any model of 32.000 ctx-size and larger with this solution. As you can see, I saw for example, 4-5x the same news information flowing into the model and overflowing even 500.000 tokens. I am happy to share my solution with self-hosted reranker, searXNG and firecrawl. we can compare them and find a good solid ground. |
|
If it doesn't cause constraints for more capable models in chewing relevant info to any degree, I'm all for it. So far I haven't hit any limitation with the current setup, including when I increase firecrawl pool settings to grab more info, at least from what I'm seeing. I like seeing 50+ sources. You know perplexity stalls you at number much lower with model entrapment, unless you pay dearly for subpar quality in comparison to what I get. I believe there is a need to ensure smaller models running on ollama or vllm can perform WebSearch without issue, just gotta be careful not to stifle use-cases that larger models can handle. In the EU the AI act requires a lot of privacy measures, often times forcing companies to adopt a local air-gapped solution. And for the smallest of business operations, a local model option that can run on ollama or vllm should be a targeted design for LC's WebSearch feature, but not at the mercy of breaking what already works.
|
|
Made some edits to my previous post. Hoping its more legible. Need more coffee its the morning here. :) |
Hi Danny,
I was searching for why my local websearch, webcrawler and reranker always killed my model context, which led to completely lobotomized models as the context/kv-cache got overwritten. This is already 6 months ago ( see https://www.reddit.com/r/LocalLLaMA/comments/1mucj1p/which_models_are_suitable_for_websearch/ )
I had to dig deep to figure out what is going on in the complex of Librechat -> agent and flow between searXNG, Firecrawl, Jina-reranker and local models which usually have a limited amount of kv-cache.
I used the last weeks to add more and more debug code to the agents code and found this:
First fix: expandHighlights sending whole content.
With default
expandHighlightsparameters (mainExpandBy=300,separatorExpandBy=150), every web search query overflowed the context window:Per-source JSON size inspection confirmed sources were carrying full page content:
individual source sizes in the hundreds of thousands of characters.
expandHighlights()had an early-return path for sources with no highlights that returned the original result object unmodified, including the full raw scraped page content from Firecrawl. This caused the entire scraped content of those pages to pass through intoformatResultsForLLMand be sent to the LLM as part of the tool output, producing context window overflows on every web search query.What I found:
In
src/tools/search/highlights.ts, theexpandHighlights()function maps over organic results and top stories to expand highlight boundaries. For sources that had no highlights (either because the reranker returned nothing, the source failed toscrape, or content was empty), the function returned early:
The normal code path (when highlights exist) correctly creates a shallow copy and
deletes content:
But the early-return path bypassed this entirely, passing
result.contentthe full scraped markdown from Firecrawl, potentially hundreds of kilobytes per page directly into the output.The Change
Strip
contentfrom the result in the early-return path before returning:contentis now always removed before a result leavesexpandHighlights(),regardless of whether highlights exist.
Second fix: content passed to the reranker was not capped
While chasing the overflow, I found a second problem that became visible once the early-return content leak was closed.
With the first fix in place, sources that do have highlights now correctly delete
contentafter expansion. Butcontentwas still being stored uncapped on theScrapeResultobject before ever reachingexpandHighlights. A large page with around 80k respectively 200k, or more chars was stored in full, thenexpandHighlightsrancontent.indexOf(highlight.text)and boundary searches across the entire string. More critically,getHighlights()was also receiving the full uncapped content and handing it toRecursiveCharacterTextSplitter, which would happily produce hundreds of small chunks and send all of them to the reranker in a single request.The fix is in three places:
At storage time (
search.tsscrapeMany): after cleaning the text, the content is capped before being stored on the result object:Before chunking (
search.tsgetHighlights): even if the content arrived uncapped through a different code path, it gets capped again before the text splitter runs:The cap itself is set to 50000 chars by default, overridable per-deployment via the
SEARCH_MAX_CONTENT_LENGTHenvironment variable. It is wired throughProcessSourcesConfig.maxContentLength->createSourceProcessor-> both call sites above, socreateSearchToolcallers can tune it without touching the source.Why 50000?
With the default
chunkSize=150andchunkOverlap=50, a 50 000-char page produces roughly 330 chunks. That is already a lot for a reranker to score. In practice, clean Firecrawl markdown from a well-structured page is 5–40k chars, so the cap rarely fires, but it prevents a degenerate case (minified content, dense tables, very long articles) from saturating the reranker or causing latency spikes.Third fix: reducing news for SearXNG similar to Serper
After the first two fixes brought output sizes down to manageable levels, I figured out that SearXNG when news results are enabled, it is sending far too much (duplicated) news.
executeParallelSearchesintool.tsfires a main SearXNG search alongside an optional parallel news sub-search. Both return news items that will be merged intotopStories. SearXNG's general search also includes its ownnewscategory which gets converted totopStoriesmoreover. The result is enteringprocessSourcescan carry 24–37topStoriesitems instead of the expected 5.This is a SearXNG-specific issue. The Serper adapter already slices
topStoriesto 5 before the data leavescreateSerperAPI:SearXNG has no equivalent cap, so all 24–37 items were scraped, reranked, and passed to
formatResultsForLLM. In practice this producedn=32sources enteringexpandHighlightsand output of 114,771 chars / ~28,693 tokens which is enough to overflow smaller context windows and leave almost no room for conversation history or a model response.The fix (
search.tsprocessSources): slicetopStoriestonumElementsafterupdateSourcesWithContent, consistent with how organic results are already limited via thetargetparameter infetchContents. The slice happens after enrichment so SearXNG's relevance ordering is preserved:After this fix, the same query dropped from
n=32ton=13(5 organic + up to 5 capped topStories + a small number from the main search's own news field) and output fell to ~35,687 chars / ~8,922 tokens.Infrastructure and codebase:
I tested this LibreChat 0.8.3rc1 and agents 3.1.52 (and last test with 3.1.53 and 3.1.54) against a self-hosted instance using:
Before the fixes
With default
expandHighlightsparameters (mainExpandBy=300,separatorExpandBy=150), every web search query overflowed the context window:Per-source JSON size inspection confirmed sources were carrying full page content: individual source sizes in the hundreds of thousands of characters.
After the fixes
With identical parameters and the same queries, output sizes dropped to normal:
Per-source JSON sizes after the fixes: all sources between 176-655 chars (title + URL + snippet + highlights only, No content field). Both queries completed successfully, and the model produced correct, well-cited answers.
Why the first bug was challenging to spot
The bug was masked during an earlier investigation where I tried reducing the expansion parameters (
mainExpandBy=150,separatorExpandBy=75). With smaller expansion, more sources successfully produced highlights and hit the normal code path (which correctly deletes content). Only when I restored the original default values did the early-return path become dominant, exposing the leak.A parameter tuning session that appeared to "fix" the overflow by halving the expansion values was in fact masking this underlying bug. The actual fix is a 2-line change independent of any expansion parameter values.
Testing
I verified end-to-end on:
Broad navigational query ("search the web and tell me what are todays news from heise.de newsticker?"):
pipeline completes, output ~5,700 tokens, model receives title/snippet/URL metadata only for sources without highlights (correct behavior)
Specific factual query ("Was ist mit dem Entwickler von Moltbot geschehen?"):
pipeline completes, output ~28,790 tokens (within 41k context window), model produces accurate answer with correct source citations, name history (ClawdBot -> Moltbot -> OpenClaw), security context, and project future
No regressions are observed in sources that do have highlights. The normal expansion and content deletion path is unchanged. With the content cap in place, reranker request sizes stay bounded regardless of page length.