feat: Add tokenizer-based text chunking for NER recognizers#2041
feat: Add tokenizer-based text chunking for NER recognizers#2041yuriihavrylko wants to merge 16 commits into
Conversation
…r tokenizer support
… text chunking strategies
…nizer-based approach
There was a problem hiding this comment.
Pull request overview
Adds configurable tokenizer-aware text chunking for NER recognizers, improving long-text handling for HuggingFace and GLiNER based recognition.
Changes:
- Introduces
TokenizerBasedTextChunkerand lazy exports/imports to keeptransformersoptional. - Allows
GLiNERRecognizerandHuggingFaceNerRecognizerto accept YAML-styletext_chunkerdict configs. - Adds tests and documentation for chunker configuration and GLiNER chunking usage.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
presidio-analyzer/presidio_analyzer/chunkers/tokenizer_based_text_chunker.py |
Adds tokenizer-based chunking implementation. |
presidio-analyzer/presidio_analyzer/chunkers/text_chunker_provider.py |
Adds provider support for tokenizer chunker type. |
presidio-analyzer/presidio_analyzer/chunkers/__init__.py |
Exposes tokenizer chunker through lazy import. |
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py |
Adds configurable chunker support and chunk size parameters. |
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py |
Adds dict-based chunker config support. |
presidio-analyzer/tests/test_tokenizer_based_text_chunker.py |
Adds unit tests for tokenizer chunking behavior. |
presidio-analyzer/tests/test_gliner_recognizer.py |
Adds GLiNER chunker configuration tests. |
presidio-analyzer/tests/test_huggingface_ner_recognizer.py |
Adds HuggingFace recognizer chunker configuration tests. |
docs/samples/python/gliner.md |
Documents GLiNER text chunking configuration. |
docs/analyzer/recognizer_registry_provider.md |
Documents YAML text_chunker configuration. |
omri374
left a comment
There was a problem hiding this comment.
Thanks! Great addition!
Please make sure that the YAML flow is properly tested
| threshold: float = 0.30, | ||
| map_location: Optional[str] = None, | ||
| text_chunker: Optional[BaseTextChunker] = None, | ||
| text_chunker: Optional[Union[BaseTextChunker, Dict[str, Any]]] = None, |
There was a problem hiding this comment.
In which scenario would the user pass a dict here? I think it's better to ask the user to pass the chunker class, and instantiate it using the dict prior to calling the recognizer.
There was a problem hiding this comment.
The dict path is used when loading from yaml config. The recognizer registry passes yaml fields as kwargs directly to the constructor, so text_chunker arrives as a raw dict like {"chunker_type": "tokenizer", "tokenizer": "model-name", "max_tokens": 512}.
Without dict support here, users would need custom python code to instantiate the chunker before passing it - which defeats the purpose of yaml-based configuration.
The alternative would be handling the dict-to-object conversion in the registry loader, but that would require the loader to know about chunker-specific logic. Keeping it in the recognizer felt more self-contained
There was a problem hiding this comment.
We have a pydantic validation layer between YAML and actual Presidio classes to handle configuration errors more gracefully. I don't see a reason not to use it here too, and avoid generic dicts as input. Please take a look and see if there's a reason for it not to apply here too. Thanks!
There was a problem hiding this comment.
Done. Moved the dict-to-object conversion into the Pydantic validation layer (TextChunkerConfig in yaml_recognizer_models.py) and the registry loader. Recognizer constructors now only accept Optional[BaseTextChunker], no more dicts
…rs in recognizer registry
…t in TextChunkerProvider
…enizerBasedTextChunker
…and HuggingFace recognizers
|
Looking at the solution, I noticed that requiring users to specify a tokenizer each time they want to use the tokenizer-based chunker isn't ideal - in most cases it will be the same tokenizer as the model itself. Specifying it separately is redundant and error-prone (risk of mismatch between chunker tokenizer and model tokenizer). For GLiNER models it's not even possible since their repos can't be loaded by So I made the tokenizer optional and added a deferred resolution pattern - the chunker is created without a tokenizer from YAML config, and the recognizer automatically resolves it at model-load time using the model's own tokenizer. This way users just configure chunk sizes in YAML and the right tokenizer is picked up automatically. Less config, fewer mistakes. YAML example |
| model_config = ConfigDict(extra="forbid", arbitrary_types_allowed=True) | ||
|
|
||
| chunker_type: Literal["character", "tokenizer"] = Field( | ||
| ..., description="Type of chunker" | ||
| ) | ||
| chunk_size: Optional[int] = Field(None, description="Character chunk size") | ||
| chunk_overlap: Optional[int] = Field(None, description="Character overlap") |
There was a problem hiding this comment.
Intentional - chunker_type is required to keep YAML configs explicit. If a user writes text_chunker: {} that's likely a mistake and should be caught by validation, not silently default to character chunking. The recognizer already defaults to CharacterBasedTextChunker when text_chunker is omitted entirely
| # 2. Convert text_chunker dict to BaseTextChunker instance | ||
| if "text_chunker" in kwargs and isinstance(kwargs["text_chunker"], dict): | ||
| from presidio_analyzer.chunkers import TextChunkerProvider | ||
|
|
||
| # Strip None values that may leak from Pydantic model_dump | ||
| chunker_config = { | ||
| k: v for k, v in kwargs["text_chunker"].items() if v is not None | ||
| } | ||
| kwargs["text_chunker"] = TextChunkerProvider( | ||
| chunker_config | ||
| ).create_chunker() |
There was a problem hiding this comment.
Not a concern in practice - text_chunker only appears in configs for recognizers that accept it. And with deferred tokenizer mode, no heavy work happens here (no model/tokenizer downloading) - just a lightweight placeholder object is created
…rove tokenizer description
Change Description
Add
TokenizerBasedTextChunkerthat uses HuggingFace tokenizers for accurate token-based text splitting, and make text chunking configurable from yaml for bothGLiNERRecognizerandHuggingFaceNerRecognizerWhat's new
TokenizerBasedTextChunker: splits text by token count using the model's actual tokenizer, respecting its token limittext_chunkerconfig:text_chunkerparameter on NER recognizers now accepts a dict (from YAML) in addition to Python objects, resolved viaTextChunkerProviderchunk_size/chunk_overlapparams toGLiNERRecognizer(HuggingFaceRecognizeralready had them)YAML example
Why it matters
deployment-only workflows
GLiNERRecognizerhad no chunking configurability - chunk size and overlap were hardcoded (250 chars / 50 overlap) with no way to adjust them from config or constructor params. Now bothchunk_size/chunk_overlapparams and the fulltext_chunkerconfig are supportedIssue reference
No linked issue
Checklist