Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ All notable changes to this project will be documented in this file.

## [unreleased]

### General
#### Changed
- Documented how to build custom `presidio-analyzer` Docker images with alternative analyzer, NLP, and recognizer-registry YAML files, including multilingual configuration checks and common startup warnings. Fixes [#1663](https://github.com/microsoft/presidio/issues/1663).

### Anonymizer
#### Fixed
- Custom operator `validate()` no longer calls the user-supplied lambda with a dummy `"PII"` value. Previously, stateful lambdas (e.g. those accumulating a token-to-original-value map for de-anonymization) would receive a spurious invocation during validation, inserting a junk entry (`{"TOKEN_1": "PII"}`) into the map and skewing all subsequent token counters. The return-type contract is now enforced in `operate()` when the lambda runs on real data. Fixes [#2024](https://github.com/microsoft/presidio/issues/2024).
Expand Down Expand Up @@ -809,4 +813,3 @@ New endpoint for deanonymizing encrypted entities by the anonymizer.
[2.2.23]: https://github.com/microsoft/presidio/compare/2.2.2...2.2.23
[2.2.2]: https://github.com/microsoft/presidio/compare/2.2.1...2.2.2
[2.2.1]: https://github.com/microsoft/presidio/compare/2.2.0...2.2.1

83 changes: 73 additions & 10 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,10 +96,10 @@ To download the Presidio Docker containers, run the following command:

This requires Docker to be installed. [Download Docker](https://docs.docker.com/get-docker/).

### For PII anonymization in text
For PII detection and anonymization in text, the `presidio-analyzer`
and `presidio-anonymizer` modules are required.
### For PII anonymization in text

For PII detection and anonymization in text, the `presidio-analyzer`
and `presidio-anonymizer` modules are required.

```sh
# Download Docker images
Expand All @@ -109,12 +109,75 @@ docker pull mcr.microsoft.com/presidio-anonymizer
# Run containers with default ports
docker run -d -p 5002:3000 mcr.microsoft.com/presidio-analyzer:latest

docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest
```

### For PII redaction in images

For PII detection in images, the `presidio-image-redactor` is required.
docker run -d -p 5001:3000 mcr.microsoft.com/presidio-anonymizer:latest
```

#### Building a custom `presidio-analyzer` image

The published analyzer image ships with the default English-only YAML files:

- `presidio_analyzer/conf/default_analyzer.yaml`
- `presidio_analyzer/conf/default.yaml`
- `presidio_analyzer/conf/default_recognizers.yaml`

From the repository root after cloning this repo, run the following commands to add
more languages or enable a different recognizer mix. Create custom copies of those
files inside the `presidio-analyzer/` build context and point the Docker build to
them with build arguments.

For example:

Comment thread
ded-furby marked this conversation as resolved.
```sh
cd /path/to/presidio

cp presidio-analyzer/presidio_analyzer/conf/default_analyzer.yaml \
presidio-analyzer/presidio_analyzer/conf/custom_analyzer.yaml
cp presidio-analyzer/presidio_analyzer/conf/default.yaml \
presidio-analyzer/presidio_analyzer/conf/custom_nlp.yaml
cp presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml \
presidio-analyzer/presidio_analyzer/conf/custom_recognizers.yaml
```

Then update the copies as needed:

- Add every runtime language to `supported_languages` in both
`custom_analyzer.yaml` and `custom_recognizers.yaml`.
- Configure the NLP models for those languages in `custom_nlp.yaml`.
- Enable or add recognizers that should run for the new languages.

Build the image with the custom file paths:

```sh
docker build ./presidio-analyzer \
-t presidio-analyzer-custom \
--build-arg ANALYZER_CONF_FILE=presidio_analyzer/conf/custom_analyzer.yaml \
--build-arg NLP_CONF_FILE=presidio_analyzer/conf/custom_nlp.yaml \
--build-arg RECOGNIZER_REGISTRY_CONF_FILE=presidio_analyzer/conf/custom_recognizers.yaml
```

Run it the same way as the default image:

```sh
docker run -d -p 5002:3000 presidio-analyzer-custom
```

!!! note "Important configuration checks"

- `supported_languages` must match between the analyzer and recognizer-registry YAML files.
- The Docker build installs the NLP models declared in `NLP_CONF_FILE` (or in the analyzer file if you use a single combined config), so adding more or larger models increases build time and image size.
- If the container logs warnings such as `NLP recognizer ... is not in the list of recognizers for language ...`, check that the recognizer registry still includes the NLP recognizer for that language (for example `SpacyRecognizer`) and that the same language code appears in both YAML files.
- If you are experimenting with many languages at once, start with a smaller subset first. Downloading and loading several large NLP models in one image can significantly increase memory usage during build and startup.

For more background on the YAML structure, see:

- [Analyzer Engine Provider](analyzer/analyzer_engine_provider.md)
- [PII detection in different languages](analyzer/languages.md)
- [Customizing NLP models](analyzer/customizing_nlp_models.md)
- [Recognizer registry configuration](analyzer/recognizer_registry_provider.md)

### For PII redaction in images

For PII detection in images, the `presidio-image-redactor` is required.

```sh
# Download Docker image
Expand Down