feat(analyzer): Deduplicate NER model instances in multilingual configuration to reduce memory usage#2052
Conversation
…nizers to avoid in-memory duplicates
…lp engine, and recognizers
| # Class-level cache for sharing GLiNER models across instances. | ||
| # Keyed by (model_name, map_location, load_onnx_model, onnx_model_file). | ||
| # Avoids loading duplicate copies when the same model serves multiple languages. | ||
| _shared_models: dict = {} |
There was a problem hiding this comment.
Can we think of an alternative approach using dependency injection or model registry? This would never get released.
For example:
model = GLiNER.from_pretrained(...)
recognizer_en = GLiNERRecognizer(..., model=model)
recognizer_es = GLiNERRecognizer(..., model=model)
recognizer_fr = GLiNERRecognizer(..., model=model)Or
self.gliner = GLiNERModelRegistry.get_model(...)The user should be able to control this model registry (add, remove, update)
There was a problem hiding this comment.
Good point - the initial approach is naive, but it showcases the reality of the issue and the potential gains in multilingual setups. My intent was to implement it for both programmatic and yaml config-based use cases.
Here are the options I'm considering based on DI and model registry ideas:
Option A: DI + loader-level sharing
Dependency injection as you shown, plus
# YAML path:
# RecognizerListLoader detects same-model recognizers and
# injects the loaded model into subsequent instancesSimple, no new classes. But couples the loader to each recognizer's internals (gliner_model= vs ner_pipeline=, different cache key shapes). Every new recognizer type would need loader changes.
Option B: ModelRegistry
# Programmatic — user controls the registry:
registry = ModelRegistry()
rec_en = GLiNERRecognizer(model_registry=registry, supported_language="en")
rec_es = GLiNERRecognizer(model_registry=registry, supported_language="es")
# First instance loads and registers, second reuses
# Or direct injection (no registry needed):
model = GLiNER.from_pretrained(...)
rec = GLiNERRecognizer(gliner_model=model, ...)
# YAML path — automatic:
# RecognizerListLoader creates a ModelRegistry and injects it
# into recognizers that accept `model_registry` parameterThe caching logic (key shape, what to store) stays inside each recognizer - the loader just provides the shared bucket and doesn't need model-specific knowledge.
Option C: Multi-language recognizer
GLiNERRecognizer(supported_languages=["en", "es", "de", ...])Eliminates the problem at the root - one instance, one model, no sharing needed. But EntityRecognizer is built around supported_language (singular). Cleanest long-term solution but a major refactor, not a bug fix/small improvement scope.
Option D: Lazy loading
Remove self.load() from EntityRecognizer.__init__, load on first analyze() call instead. Makes DI/registry/sharing trivially easy since all instances are created cheaply and models configured afterward. But it's a base class behavior change that affects every recognizer.
I prefer option B for this PR - covers both use cases, keeps the loader generic, gives the user full control.
Options C and D are worth considering as longer-term architectural improvements.
What do you think?
|
Thanks @yuriihavrylko, great addition! I've added one comment on the design- let's discuss. |
Change Description
Problem
When running Presidio Analyzer with a multilingual HuggingFace NER configuration (e.g., 8 languages), the
RecognizerListLoadercreates one recognizer instance per language. Each instance callsload()during__init__, loading a separate copy of the same model into memory. For a HuggingFace transformer model (~400MB), 8 languages means ~3.2GB of duplicated weights. The same issue affects GLiNER and spaCy multilingual models (xx_ent_wiki_sm).Before fix: ~3.5GB memory (Docker, CPU, 8 languages with
dslim/bert-base-NER)After fix: ~1.1GB memory (same setup)
The same improvement applies to GPU deployments - model weights are the bottleneck regardless of device (RAM or VRAM).
Root causes
Languageinstances for the samexx_ent_wiki_smmodel, each with independently growingStringStore/VocabChanges
Model sharing (memory fix)
HuggingFaceNerRecognizer: added class-level_shared_pipelinescache keyed by(model_name, tokenizer_name, aggregation_strategy, device). Instances with identical config reuse the same pipeline.GLiNERRecognizer: added class-level_shared_modelscache keyed by(model_name, map_location, load_onnx_model, onnx_model_file). Same pattern.SpacyNlpEngine.load(): uses a localloaded_modelsdict to share a singlespacy.Languageinstance when multiple languages use the samemodel_name.Tests
Thread safety note
The class-level caches are safe for multiprocessing deployments (e.g., gunicorn sync workers, which is the default). For multithreaded deployments, a lock would be needed - but Presidio's standard deployment pattern uses multiprocessing.
Issue reference
No linked issue
Checklist