Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.

## [unreleased]

### General
#### Changed
- Documented how to build custom `presidio-analyzer` Docker images with alternative analyzer, NLP, and recognizer-registry YAML files, including multilingual configuration checks and common startup warnings. Fixes [#1663](https://github.com/microsoft/presidio/issues/1663).

### Anonymizer
#### Fixed
- Custom operator `validate()` no longer calls the user-supplied lambda with a dummy `"PII"` value. Previously, stateful lambdas (e.g. those accumulating a token-to-original-value map for de-anonymization) would receive a spurious invocation during validation, inserting a junk entry (`{"TOKEN_1": "PII"}`) into the map and skewing all subsequent token counters. The return-type contract is now enforced in `operate()` when the lambda runs on real data. Fixes [#2024](https://github.com/microsoft/presidio/issues/2024).
Expand Down Expand Up @@ -809,4 +813,3 @@ New endpoint for deanonymizing encrypted entities by the anonymizer.
[2.2.23]: https://github.com/microsoft/presidio/compare/2.2.2...2.2.23
[2.2.2]: https://github.com/microsoft/presidio/compare/2.2.1...2.2.2
[2.2.1]: https://github.com/microsoft/presidio/compare/2.2.0...2.2.1

80 changes: 70 additions & 10 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,10 +96,10 @@ To download the Presidio Docker containers, run the following command:

This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/).

### For PII anonymization in text
For PII detection and anonymization in text, the `presidio-analyzer`
and `presidio-anonymizer` modules are required.
### For PII anonymization in text

For PII detection and anonymization in text, the `presidio-analyzer`
and `presidio-anonymizer` modules are required.

```sh
# Download Docker images
Expand All @@ -109,12 +109,72 @@ docker pull mcr.microsoft.com/presidio-anonymizer
# Run containers with default ports
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest

docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest
```

### For PII redaction in images

For PII detection in images, the `presidio-image-redactor` is required.
docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest
```

#### Building a custom `presidio-analyzer` image

The published analyzer image ships with the default English-only YAML files:

- `presidio_analyzer/conf/default_analyzer.yaml`
- `presidio_analyzer/conf/default.yaml`
- `presidio_analyzer/conf/default_recognizers.yaml`

To add more languages or enable a different recognizer mix, create custom copies
of those files inside the `presidio-analyzer/` build context and point the Docker
build to them with build arguments.

For example:

Comment thread
ded-furby marked this conversation as resolved.
```sh
cp presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml \
presidio-analyzer/presidio_analyzer/conf/custom_analyzer.yaml
cp presidio-analyzer/presidio_analyzer/conf/default.yaml \
presidio-analyzer/presidio_analyzer/conf/custom_nlp.yaml
cp presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml \
presidio-analyzer/presidio_analyzer/conf/custom_recognizers.yaml
```

Then update the copies as needed:

- Add every runtime language to `supported_languages` in both
`custom_analyzer.yaml` and `custom_recognizers.yaml`.
- Configure the NLP models for those languages in `custom_nlp.yaml`.
- Enable or add recognizers that should run for the new languages.

Build the image with the custom file paths:

```sh
docker build ./presidio-analyzer \
-t presidio/presidio-analyzer-custom \
Comment thread
ded-furby marked this conversation as resolved.
Outdated
--build-arg ANALYZER_CONF_FILE=presidio_analyzer/conf/custom_analyzer.yaml \
--build-arg NLP_CONF_FILE=presidio_analyzer/conf/custom_nlp.yaml \
--build-arg RECOGNIZER_REGISTRY_CONF_FILE=presidio_analyzer/conf/custom_recognizers.yaml
```

Run it the same way as the default image:

```sh
docker run -d -p 5002:3000 presidio/presidio-analyzer-custom
```

!!! note "Important configuration checks"

- `supported_languages` must match between the analyzer and recognizer-registry YAML files.
- The Docker build installs the NLP models declared in `NLP_CONF_FILE` (or in the analyzer file if you use a single combined config), so adding more or larger models increases build time and image size.
- If the container logs warnings such as `NLP recognizer ... is not in the list of recognizers for language ...`, check that the recognizer registry still includes the NLP recognizer for that language (for example `SpacyRecognizer`) and that the same language code appears in both YAML files.
- If you are experimenting with many languages at once, start with a smaller subset first. Downloading and loading several large NLP models in one image can significantly increase memory usage during build and startup.

For more background on the YAML structure, see:

- [Analyzer Engine Provider](analyzer/analyzer_engine_provider.md)
- [PII detection in different languages](analyzer/languages.md)
- [Customizing NLP models](analyzer/customizing_nlp_models.md)
- [Recognizer registry configuration](analyzer/recognizer_registry_provider.md)

### For PII redaction in images

For PII detection in images, the `presidio-image-redactor` is required.

```sh
# Download Docker image
Expand Down