feat: deduplication layer for scrape_batch + field filters#140
Conversation
…arch_engine_batch SECURITY FIXES: - Add prototype pollution protection in filterFields() - Block __proto__, constructor, prototype properties - Sanitize error messages to prevent information disclosure - No hardcoded API keys in any file FUNCTIONALITY: - Add deduplication layer to scrape_batch tool - Add field filtering to search_engine_batch tool - Remove duplicate content blocks across URLs - Include metrics option for dedup stats TEST FILES: - test_context_cache.js: 9 tests - test_dedup_edge_cases.js: 8 tests - test_filter_fields.js: 20 tests Total: 37 tests passing
|
Thank you for the contribution, this is a thoughtful PR and the token-cost problem you're targeting is a real one. I went through it carefully and ran the tests locally (37/37 passing). I have some feedback below, mostly on the dedup layer. The field-filtering side I think is in good shape. The dedup half and the field-filter half read a bit like two separate PRs, and they land differently for me. The field filtering is solid. The dedup layer is where I'd like to talk through a few things. Concerns with the dedup layer1. It doesn't quite remove what the description says it removes. The stated problem is shared nav/header/footer across pages on one domain. But the fingerprint is 2. Sampled hashing can drop real data. If two genuinely different pages happen to share the first 2048 chars, a 256-char window at the midpoint, and the last 256 chars, the second one gets flagged duplicate and its 3. 4. Suggestions
Really appreciate the effort here, especially the test coverage and the prototype-pollution guard in |
|
Hi, thanks a lot for taking the time on this, and for actually running the tests, that means a lot! You're right on all four, and the first one is the one that stings because it's true: the fingerprint hashes a sampled window over the whole page, so two articles that only share the chrome get different hashes and nothing collapses. The boilerplate I described is exactly what doesn't get removed.... I also have to own something, the PR notes say we "switched to full-content SHA-256." That's not what's in this branch, this branch still samples. We moved to full-content hashing in another codebase and the note got ahead of the actual diff here. My mistake, soryabout that. Here's what I'd like to do:
If the split PR looks good I'll close this one. Really appreciate you engaging this deeply — this is the kind of review you learn from!!! |
Summary
Deduplication layer for batch scraping that removes duplicate content blocks across URLs, reducing token usage in LLM pipelines.
Problem
When scraping multiple URLs from the same domain, pages share nav/header/footer HTML. Without dedup, identical content is returned N times, wasting tokens and increasing costs.
Solution
scrape_batch new parameters
deduplicatetrueinclude_metricsfalse{results: [...], metrics: {...}}responsefieldsformatmarkdownmarkdown(default) orrawsearch_engine_batch
fieldsresult.organicarray to requested keys (link, title, description, relevance_score, cursor)Hash Algorithm
This captures shared headers/footers while correctly distinguishing pages with same structure but different body content.
Test Suite
Backward Compatibility
Default behavior (no params or
include_metrics: false) returns flat array — no breaking changes to existing consumers.Prior Art
Deduplication strategy inspired by ContextForge (https://github.com/SuarezPM/Apohara_Context_Forge, DOI: 10.5281/zenodo.20277875).