Skip to content

feat: Add tokenizer-based text chunking for NER recognizers#2041

Open
yuriihavrylko wants to merge 16 commits into
microsoft:mainfrom
yuriihavrylko:feat/tokenizer-based-text-chunker
Open

feat: Add tokenizer-based text chunking for NER recognizers#2041
yuriihavrylko wants to merge 16 commits into
microsoft:mainfrom
yuriihavrylko:feat/tokenizer-based-text-chunker

Conversation

@yuriihavrylko
Copy link
Copy Markdown
Contributor

@yuriihavrylko yuriihavrylko commented May 26, 2026

Change Description

Add TokenizerBasedTextChunker that uses HuggingFace tokenizers for accurate token-based text splitting, and make text chunking configurable from yaml for both GLiNERRecognizer and HuggingFaceNerRecognizer

What's new

  • TokenizerBasedTextChunker: splits text by token count using the model's actual tokenizer, respecting its token limit
  • YAML text_chunker config: text_chunker parameter on NER recognizers now accepts a dict (from YAML) in addition to Python objects, resolved via TextChunkerProvider
  • GLiNER parity: added missing chunk_size/chunk_overlap params to GLiNERRecognizer (HuggingFaceRecognizer already had them)
  • Lazy imports for transformers - only required when actually using the tokenizer chunker

YAML example

  - name: GLiNERRecognizer                                                                                                        
    type: predefined                                                                                                                
    text_chunker:
      chunker_type: tokenizer                                                                                                       
      tokenizer: urchade/gliner_multi_pii-v1                                                                                        
      max_tokens: 512                                                                                                               
      overlap_tokens: 32 

Why it matters

  • Character-based approximation either wastes capacity or exceeds limits. Token-based chunking uses the exact budget available
  • Configurable from YAML means users can tune chunking strategy without writing Python code, making it accessible for
    deployment-only workflows
  • GLiNERRecognizer had no chunking configurability - chunk size and overlap were hardcoded (250 chars / 50 overlap) with no way to adjust them from config or constructor params. Now both chunk_size/chunk_overlap params and the full text_chunker config are supported

Issue reference

No linked issue

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable tokenizer-aware text chunking for NER recognizers, improving long-text handling for HuggingFace and GLiNER based recognition.

Changes:

  • Introduces TokenizerBasedTextChunker and lazy exports/imports to keep transformers optional.
  • Allows GLiNERRecognizer and HuggingFaceNerRecognizer to accept YAML-style text_chunker dict configs.
  • Adds tests and documentation for chunker configuration and GLiNER chunking usage.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
presidio-analyzer/presidio_analyzer/chunkers/tokenizer_based_text_chunker.py Adds tokenizer-based chunking implementation.
presidio-analyzer/presidio_analyzer/chunkers/text_chunker_provider.py Adds provider support for tokenizer chunker type.
presidio-analyzer/presidio_analyzer/chunkers/__init__.py Exposes tokenizer chunker through lazy import.
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py Adds configurable chunker support and chunk size parameters.
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py Adds dict-based chunker config support.
presidio-analyzer/tests/test_tokenizer_based_text_chunker.py Adds unit tests for tokenizer chunking behavior.
presidio-analyzer/tests/test_gliner_recognizer.py Adds GLiNER chunker configuration tests.
presidio-analyzer/tests/test_huggingface_ner_recognizer.py Adds HuggingFace recognizer chunker configuration tests.
docs/samples/python/gliner.md Documents GLiNER text chunking configuration.
docs/analyzer/recognizer_registry_provider.md Documents YAML text_chunker configuration.

Copy link
Copy Markdown
Collaborator

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Great addition!

Please make sure that the YAML flow is properly tested

Comment thread presidio-analyzer/presidio_analyzer/chunkers/__init__.py Outdated
Comment thread docs/analyzer/recognizer_registry_provider.md
threshold: float = 0.30,
map_location: Optional[str] = None,
text_chunker: Optional[BaseTextChunker] = None,
text_chunker: Optional[Union[BaseTextChunker, Dict[str, Any]]] = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which scenario would the user pass a dict here? I think it's better to ask the user to pass the chunker class, and instantiate it using the dict prior to calling the recognizer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dict path is used when loading from yaml config. The recognizer registry passes yaml fields as kwargs directly to the constructor, so text_chunker arrives as a raw dict like {"chunker_type": "tokenizer", "tokenizer": "model-name", "max_tokens": 512}.

Without dict support here, users would need custom python code to instantiate the chunker before passing it - which defeats the purpose of yaml-based configuration.

The alternative would be handling the dict-to-object conversion in the registry loader, but that would require the loader to know about chunker-specific logic. Keeping it in the recognizer felt more self-contained

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a pydantic validation layer between YAML and actual Presidio classes to handle configuration errors more gracefully. I don't see a reason not to use it here too, and avoid generic dicts as input. Please take a look and see if there's a reason for it not to apply here too. Thanks!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved the dict-to-object conversion into the Pydantic validation layer (TextChunkerConfig in yaml_recognizer_models.py) and the registry loader. Recognizer constructors now only accept Optional[BaseTextChunker], no more dicts

Copilot AI review requested due to automatic review settings June 1, 2026 06:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Comment thread presidio-analyzer/presidio_analyzer/chunkers/tokenizer_based_text_chunker.py Outdated
Comment thread presidio-analyzer/tests/test_gliner_recognizer.py Outdated
Comment thread docs/samples/python/gliner.md
Copilot AI review requested due to automatic review settings June 1, 2026 21:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Comment thread presidio-analyzer/presidio_analyzer/chunkers/tokenizer_based_text_chunker.py Outdated
@yuriihavrylko
Copy link
Copy Markdown
Contributor Author

Looking at the solution, I noticed that requiring users to specify a tokenizer each time they want to use the tokenizer-based chunker isn't ideal - in most cases it will be the same tokenizer as the model itself. Specifying it separately is redundant and error-prone (risk of mismatch between chunker tokenizer and model tokenizer). For GLiNER models it's not even possible since their repos can't be loaded by AutoTokenizer (so users will be need to use backbone tokenizer).

So I made the tokenizer optional and added a deferred resolution pattern - the chunker is created without a tokenizer from YAML config, and the recognizer automatically resolves it at model-load time using the model's own tokenizer. This way users just configure chunk sizes in YAML and the right tokenizer is picked up automatically. Less config, fewer mistakes.

YAML example

  - name: GLiNERRecognizer                                                                                                        
    type: predefined 
    model_name: urchade/gliner_multi_pii-v1                                                                                                               
    text_chunker:
      chunker_type: tokenizer                                                                                                       
      # tokenizer: urchade/gliner_multi_pii-v1         This is now optional, and be derived from model                                                                           
      max_tokens: 512                                                                                                               
      overlap_tokens: 32 

Copilot AI review requested due to automatic review settings June 4, 2026 17:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Comment on lines +150 to +156
model_config = ConfigDict(extra="forbid", arbitrary_types_allowed=True)

chunker_type: Literal["character", "tokenizer"] = Field(
..., description="Type of chunker"
)
chunk_size: Optional[int] = Field(None, description="Character chunk size")
chunk_overlap: Optional[int] = Field(None, description="Character overlap")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional - chunker_type is required to keep YAML configs explicit. If a user writes text_chunker: {} that's likely a mistake and should be caught by validation, not silently default to character chunking. The recognizer already defaults to CharacterBasedTextChunker when text_chunker is omitted entirely

Comment on lines +331 to +341
# 2. Convert text_chunker dict to BaseTextChunker instance
if "text_chunker" in kwargs and isinstance(kwargs["text_chunker"], dict):
from presidio_analyzer.chunkers import TextChunkerProvider

# Strip None values that may leak from Pydantic model_dump
chunker_config = {
k: v for k, v in kwargs["text_chunker"].items() if v is not None
}
kwargs["text_chunker"] = TextChunkerProvider(
chunker_config
).create_chunker()
Copy link
Copy Markdown
Contributor Author

@yuriihavrylko yuriihavrylko Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a concern in practice - text_chunker only appears in configs for recognizers that accept it. And with deferred tokenizer mode, no heavy work happens here (no model/tokenizer downloading) - just a lightweight placeholder object is created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants