Skip to content

Support multiple GLiNER YAML configurations#2018

Open
ynachiket wants to merge 9 commits into
microsoft:mainfrom
ynachiket:ynachiket/gliner-yaml-multiple-models-20260503
Open

Support multiple GLiNER YAML configurations#2018
ynachiket wants to merge 9 commits into
microsoft:mainfrom
ynachiket:ynachiket/gliner-yaml-multiple-models-20260503

Conversation

@ynachiket
Copy link
Copy Markdown
Contributor

Change Description

Support configuring multiple GLiNERRecognizer instances in YAML by allowing class_name: GLiNERRecognizer entries without explicit name values and deriving deterministic recognizer names from model_name when name is omitted.

This preserves explicit name values unchanged and keeps the legacy GLiNERRecognizer name for the built-in default model.

Issue reference

Addresses #1760.

Validation

Ran from presidio-analyzer in a local Python 3.12 venv:

pytest tests/test_gliner_recognizer.py -q
pytest tests/test_recognizers_loader_utils.py -q
pytest tests/test_yaml_recognizer_models.py -q
pytest tests/test_recognizer_registry_provider.py -q
git diff --check

Results:

  • tests/test_gliner_recognizer.py: 3 passed, 12 skipped
  • tests/test_recognizers_loader_utils.py: 11 passed
  • tests/test_yaml_recognizer_models.py: 47 passed
  • tests/test_recognizer_registry_provider.py: 17 passed
  • git diff --check: passed

Also attempted the full analyzer test suite; unrelated optional dependency tests fail locally without Azure/LangExtract extras installed, while the relevant targeted suites pass.

Checklist

  • I have reviewed the contribution guidelines
  • My code includes unit tests
  • Targeted unit tests and diff checks pass locally
  • Full test suite passes locally

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented May 7, 2026

Looks like a duplicate of #2007, am I correct?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables configuring multiple GLiNERRecognizer instances via YAML by allowing class_name: GLiNERRecognizer entries without explicit name, and by deriving deterministic instance names from model_name when name is omitted (while preserving the legacy GLiNERRecognizer name for the built-in default model).

Changes:

  • Update GLiNERRecognizer to accept name=None and derive a stable default name from model_name.
  • Relax YAML predefined recognizer validation to allow missing name when class_name is present, and improve error descriptions.
  • Add unit tests covering name derivation, registry loading with multiple GLiNER entries, and validation/error-message behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py Derive deterministic recognizer names from model_name when name is omitted; preserve legacy naming for default model.
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py Allow predefined configs to omit name when class_name is provided; add GLiNER-specific config model and improved error descriptions.
presidio-analyzer/tests/test_gliner_recognizer.py Add tests for name derivation and legacy-name preservation when name is omitted.
presidio-analyzer/tests/test_recognizers_loader_utils.py Add integration-style test validating two nameless GLiNER YAML entries produce distinct recognizer instances.
presidio-analyzer/tests/test_yaml_recognizer_models.py Add tests for error-description fallbacks and validation requiring name or class_name.
Comments suppressed due to low confidence (1)

presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py:62

  • _DEFAULT_GLINER_MODEL_NAME is introduced but the model_name default in GLiNERRecognizer.__init__ is still a duplicated string literal. Using the constant for the default parameter would prevent accidental drift between the default model used and the value used for legacy-name preservation.
_DEFAULT_GLINER_MODEL_NAME = "urchade/gliner_multi_pii-v1"
_LEGACY_GLINER_RECOGNIZER_NAME = "GLiNERRecognizer"


def _sanitize_model_name_for_instance_name(model_name: str) -> str:
    """Map a HF-style model id to a deterministic single-token-ish suffix."""

    sanitized = re.sub(r"[^0-9A-Za-z]+", "_", model_name)
    sanitized = re.sub(r"_+", "_", sanitized).strip("_")
    return sanitized


def _default_gliner_recognizer_name(model_name: str) -> str:
    """Stable default recognizer ``name`` when the user omits ``name``.

    Preserve the legacy name for the built-in default model for backwards compatibility.
    """

    if model_name == _DEFAULT_GLINER_MODEL_NAME:
        return _LEGACY_GLINER_RECOGNIZER_NAME
    suffix = _sanitize_model_name_for_instance_name(model_name)
    return f"{_LEGACY_GLINER_RECOGNIZER_NAME}_{suffix}"


class GLiNERRecognizer(LocalRecognizer):
    """GLiNER model based entity recognizer."""

    def __init__(
        self,
        supported_entities: Optional[List[str]] = None,
        name: Optional[str] = None,
        supported_language: str = "en",
        version: str = "0.0.1",
        context: Optional[List[str]] = None,
        entity_mapping: Optional[Dict[str, str]] = None,
        model_name: str = "urchade/gliner_multi_pii-v1",
        flat_ner: bool = True,

Comment on lines +30 to +47
def _sanitize_model_name_for_instance_name(model_name: str) -> str:
"""Map a HF-style model id to a deterministic single-token-ish suffix."""

sanitized = re.sub(r"[^0-9A-Za-z]+", "_", model_name)
sanitized = re.sub(r"_+", "_", sanitized).strip("_")
return sanitized


def _default_gliner_recognizer_name(model_name: str) -> str:
"""Stable default recognizer ``name`` when the user omits ``name``.

Preserve the legacy name for the built-in default model for backwards compatibility.
"""

if model_name == _DEFAULT_GLINER_MODEL_NAME:
return _LEGACY_GLINER_RECOGNIZER_NAME
suffix = _sanitize_model_name_for_instance_name(model_name)
return f"{_LEGACY_GLINER_RECOGNIZER_NAME}_{suffix}"
@ynachiket
Copy link
Copy Markdown
Contributor Author

Addressed the Copilot review suggestion in 6c3debc: the GLiNERRecognizer.__init__ default now reuses _DEFAULT_GLINER_MODEL_NAME, and I added a small regression test to guard against drift between the constructor default and the legacy-name comparison.\n\nValidation:\n- git diff --check\n- python -m py_compile presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py presidio-analyzer/tests/test_gliner_recognizer.py\n- PYTHONPATH=presidio-analyzer python -m pytest presidio-analyzer/tests/test_gliner_recognizer.py -q → 4 passed, 12 skipped

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented May 14, 2026

Hi, this PR partially overlaps with the merged #2007. I merged this branch with main, so let's take it from there.

return super().model_dump(*args, **kwargs)


class GLiNERRecognizerConfig(PredefinedRecognizerConfig):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a duplicate of line 210

Copilot AI review requested due to automatic review settings June 1, 2026 05:40
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment on lines +248 to +268
class GLiNERRecognizerConfig(PredefinedRecognizerConfig):
"""Configuration specifically for GLiNER recognizers."""

model_config = ConfigDict(extra="allow")

model_name: Optional[str] = Field(None, description="GLiNER model name")
entity_mapping: Optional[Dict[str, str]] = Field(
None, description="GLiNER label to Presidio entity mapping"
)
flat_ner: Optional[bool] = Field(None, description="Whether to use flat NER")
multi_label: Optional[bool] = Field(
None, description="Whether to use multi-label classification"
)
threshold: Optional[float] = Field(None, description="Confidence threshold")
map_location: Optional[str] = Field(None, description="Model device")
load_onnx_model: Optional[bool] = Field(
None, description="Whether to load GLiNER with ONNX Runtime"
)
onnx_model_file: Optional[str] = Field(None, description="ONNX model file name")


@ynachiket
Copy link
Copy Markdown
Contributor Author

Addressed the latest review feedback in e189798:

  • Removed the duplicate GLiNERRecognizerConfig definition so the existing model_dump(exclude_none=True) behavior and mutually-exclusive entity_mapping / supported_entities validator stay active.
  • Made generated GLiNER recognizer names collision-resistant by appending a short stable hash of the original model_name, with a non-empty fallback for punctuation-only model IDs.
  • Added regression coverage for sanitized-name collisions and empty sanitized suffixes.

Validation:

  • git diff --check
  • /tmp/presidio-pdlc044-venv/bin/python -m py_compile presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/gliner_recognizer.py presidio-analyzer/tests/test_gliner_recognizer.py presidio-analyzer/tests/test_yaml_recognizer_models.py
  • PYTHONPATH=presidio-analyzer /tmp/presidio-pdlc044-venv/bin/python -m pytest presidio-analyzer/tests/test_yaml_recognizer_models.py presidio-analyzer/tests/test_gliner_recognizer.py -q -> 63 passed, 12 skipped

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants