Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 195 additions & 0 deletions docs/pulid.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# PuLID-Flux face-identity preservation

stable-diffusion.cpp supports the [PuLID-Flux](https://github.com/ToTheBeginning/PuLID)
identity-injection technique on top of Flux.1 (schnell or dev) models.
Given a single source portrait, PuLID-Flux produces new generations that
preserve the source person's face across arbitrary scenes, poses, and
prompts.

Unlike PhotoMaker (which extracts the identity inside the inference
process from a directory of images), PuLID-Flux's identity extractor is
a heavy stack (insightface ArcFace + EVA-CLIP-L + IDFormer encoder) that
is impractical to port to C++/ggml. To keep this implementation small and
cross-vendor, **stable-diffusion.cpp consumes a precomputed identity
embedding** produced by an external Python tool that runs once per source
portrait. Everything downstream of that one-shot extraction is C++ and
runs on any backend (Vulkan, CUDA, Metal, ROCm, CPU).

## Architecture summary

The PuLID-Flux contribution to the Flux denoise loop is a stack of 20
small cross-attention modules (`PerceiverAttentionCA`) inserted between
the Flux transformer blocks:

- After every 2nd of the 19 double-stream blocks (10 hook points)
- After every 4th of the 38 single-stream blocks (10 hook points)

Each cross-attention layer takes the current image tokens as query, the
32-token / 2048-dim identity embedding as key+value, and adds its output
(scaled by `id_weight`, typically 1.0) back to the image tokens.

## Required weights

Three files in addition to the standard Flux weight set:

1. **Flux base** (transformer + VAE + clip_l + t5xxl) -- exactly as
[docs/flux.md](flux.md) describes.
2. **PuLID weights** -- download from
[guozinan/PuLID](https://huggingface.co/guozinan/PuLID):
- `pulid_flux_v0.9.0.safetensors` or `pulid_flux_v0.9.1.safetensors`
(recommended; this implementation is verified against v0.9.1)
- **v1.1 (`pulid_v1.1.safetensors`) is NOT yet supported** -- it uses
renamed keys (`id_adapter_attn_layers.*` instead of `pulid_ca.*`)
and possibly different module structure. Future PR.
3. **Identity embedding (.pulidembd)** -- produced by the precompute
tool below.

## Precompute the identity embedding

The precompute tool runs the PyTorch identity-extraction stack on a
single portrait image and writes the resulting `(32, 2048)` embedding
to a `.pulidembd` binary file (about 131 KB). Run it once per source
person; the same file is reused for any number of generations.

A reference Python script is provided alongside this docs file at
[`scripts/pulid_extract_id.py`](../scripts/pulid_extract_id.py). It
requires:
- A working CUDA / CPU PyTorch + diffusers stack
- `insightface`, `facexlib`, `eva-clip`, `torchvision`
- The PuLID weights file (same one stable-diffusion.cpp will load below)
- The ToTheBeginning/PuLID repo's `pulid/pipeline_flux.py` (and its
dependencies under `pulid/` and `flux/`) -- recommended to vendor
rather than pip-install due to upstream packaging quirks

Run it as:

```
python pulid_extract_id.py \
--portrait /path/to/source-photo.jpg \
--pulid-weights /path/to/pulid_flux_v0.9.1.safetensors \
--out /path/to/source.pulidembd
```

## Binary format (.pulidembd)

```
offset 0 : magic "PULIDV01" (8 bytes ASCII)
offset 8 : num_tokens (uint32 LE) typically 32
offset 12 : token_dim (uint32 LE) typically 2048
offset 16 : dtype (uint8): 0=fp16, 1=bf16, 2=fp32
offset 17 : reserved zeros (15 bytes; header total = 32)
offset 32 : tokens, row-major LE (num_tokens * token_dim values)
```

stable-diffusion.cpp parses the header, validates the magic, and converts
to fp32 at load time. Total file size for the typical (32, 2048, fp16)
case is 131 KB.

## Command-line usage

```
.\bin\Release\sd-cli.exe \
--diffusion-model models\flux1-schnell-Q4_K_S.gguf \
--vae models\ae.safetensors \
--clip_l models\clip_l.safetensors \
--t5xxl models\t5xxl_fp16.safetensors \
--pulid-weights models\pulid_flux_v0.9.1.safetensors \
--pulid-id-embedding source.pulidembd \
--pulid-id-weight 1.0 \
-p "candid photograph of a young woman on a beach at sunset" \
--cfg-scale 1.0 --sampling-method euler --steps 4 -W 512 -H 512 \
--seed 42 --clip-on-cpu \
-o out.png
```

For Flux Dev (instead of Schnell), add `--guidance 3.5` and `--steps 20`.

## Flags

| Flag | Purpose |
|----------------------------|-------------------------------------------------------------------|
| `--pulid-weights <path>` | Path to `pulid_flux_v0.9.x.safetensors`. Loaded with the model. |
| `--pulid-id-embedding <p>` | Path to a `.pulidembd` binary produced by the precompute tool. |
| `--pulid-id-weight <f>` | Identity-injection strength. Typical 0.7-1.2; default 1.0. |

All three flags must be set together to activate PuLID. Setting only
`--pulid-weights` (no embedding) loads the weights but disables injection
at runtime. Setting `--pulid-id-weight 0` zeros out the contribution
(useful for falsification testing: outputs should be byte-identical to
a no-PuLID run with the same seed).

## Memory budget

At 512x512, 4 steps (Schnell), the 20 cross-attention layers add roughly
10% to denoise time and almost nothing to peak VRAM. Tested on a 12 GB
consumer card alongside Flux Schnell Q4 GGUF + CPU-offloaded clip_l and
t5xxl + GPU-resident VAE.

At 1024x1024 with Flux Dev Q4 + 20 steps + PuLID, the VAE decode compute
buffer doesn't fit on a 12 GB card even with `--vae-on-cpu`. Workaround:
explicitly route VAE to the CPU backend instead of the offload flag:

```
--backend "diffusion=vulkan0,vae=cpu"
```

The `--vae-on-cpu` flag offloads VAE weights but leaves the compute graph
on the default backend; this is existing stable-diffusion.cpp behavior,
not a PuLID-specific issue. Documented here because anyone running PuLID
at 1024 will hit it.

## Backend selection

The standard `--backend` flag works as documented. Common patterns:

```
# AMD Vulkan
--backend "diffusion=vulkan0,vae=cpu"

# NVIDIA Vulkan
--backend "diffusion=vulkan1,vae=cpu"

# CUDA
--backend "diffusion=cuda0,vae=cpu"
```

The PuLID cross-attention layers run on the same backend as the main
diffusion model. They have not yet been independently profiled on every
backend; only Vulkan and CPU have been tested by the original contributor.

## Verification

A three-way SHA-256 check is the recommended sanity test when bringing up
a new combination of model + backend + hardware:

| Run | Expected hash relation |
|----------------------------------------------|------------------------------------|
| A: no `--pulid-*` flags | baseline |
| B: PuLID flags, `--pulid-id-weight 0.0` | **byte-identical to A** |
| C: PuLID flags, `--pulid-id-weight 1.0` | **different from A,B**, preserves source identity |

If A and C differ but A and B differ too, the injection is allocating
or computing something even at zero weight -- likely a bug.

## Limitations / not yet supported

- **`--skip-layers` (skip-layer-guidance / SLG) combined with PuLID** is not
supported. The `pulid_ca` index advances per non-skipped block, so a
skipped block silently misaligns the cross-attention weight assignment
vs. the trained intervals. The reference PyTorch implementation does
not have SLG either, so there is no well-defined behavior to emulate.
Use either feature alone.
- **PuLID v1.1 weights** (`pulid_v1.1.safetensors`, renamed key layout).
- **Multiple ID images.** The reference PyTorch implementation can fuse
several portraits into one embedding for stronger identity. This
implementation accepts a single embedding produced from one or more
images by the external precompute tool.
- **Negative-prompt branch of CFG.** PuLID only injects on the positive
conditioning path in the published reference, and the implementation
here follows that. Flux's distilled guidance doesn't run a separate
uncond branch in normal use, so this matters only for `--true-cfg`
workflows that aren't standard for Flux.
- **Backends other than Vulkan and CPU** are untested by the original
contributor. The implementation is pure-ggml and should work on CUDA,
ROCm, and Metal, but verification by users on those backends is
welcomed.
19 changes: 19 additions & 0 deletions examples/common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -384,6 +384,10 @@ ArgOptions SDContextParams::get_options() {
"--photo-maker",
"path to PHOTOMAKER model",
&photo_maker_path},
{"",
"--pulid-weights",
"path to PuLID flux weights (e.g. pulid_flux_v0.9.1.safetensors). Identity is injected during the denoise loop when paired with --pulid-id-embedding.",
&pulid_weights_path},
{"",
"--upscale-model",
"path to esrgan model.",
Expand Down Expand Up @@ -746,6 +750,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f
embedding_vec.data(),
static_cast<uint32_t>(embedding_vec.size()),
photo_maker_path.c_str(),
pulid_weights_path.c_str(),
tensor_type_rules.c_str(),
vae_decode_only,
free_params_immediately,
Expand Down Expand Up @@ -825,6 +830,10 @@ ArgOptions SDGenerationParams::get_options() {
"--pm-id-embed-path",
"path to PHOTOMAKER v2 id embed",
&pm_id_embed_path},
{"",
"--pulid-id-embedding",
"path to a .pulidembd binary produced by pulid_extract_id.py. Carries a (32, 2048) identity embedding extracted from a source portrait. Pair with --pulid-weights on the context.",
&pulid_id_embedding_path},
{"",
"--hires-upscaler",
"highres fix upscaler, Lanczos, Nearest, Latent, Latent (nearest), Latent (nearest-exact), "
Expand Down Expand Up @@ -975,6 +984,10 @@ ArgOptions SDGenerationParams::get_options() {
"--pm-style-strength",
"",
&pm_style_strength},
{"",
"--pulid-id-weight",
"strength of PuLID identity injection (default: 1.0). 0.7-1.2 are typical; lower lets the prompt override the face more, higher tightens identity match.",
&pulid_id_weight},
{"",
"--control-strength",
"strength to apply Control Net (default: 0.9). 1.0 corresponds to full destruction of information in init image",
Expand Down Expand Up @@ -2207,6 +2220,11 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
pm_style_strength,
};

sd_pulid_params_t pulid_params = {
pulid_id_embedding_path.empty() ? nullptr : pulid_id_embedding_path.c_str(),
pulid_id_weight,
};

params.loras = lora_vec.empty() ? nullptr : lora_vec.data();
params.lora_count = static_cast<uint32_t>(lora_vec.size());
params.prompt = prompt.c_str();
Expand All @@ -2227,6 +2245,7 @@ sd_img_gen_params_t SDGenerationParams::to_sd_img_gen_params_t() {
params.control_image = control_image.get();
params.control_strength = control_strength;
params.pm_params = pm_params;
params.pulid_params = pulid_params;
params.vae_tiling_params = vae_tiling_params;
params.cache = cache_params;

Expand Down
11 changes: 11 additions & 0 deletions examples/common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,11 @@ struct SDContextParams {
std::string control_net_path;
std::string embedding_dir;
std::string photo_maker_path;
// PuLID-Flux identity-preservation context path: the safetensors blob
// carrying the PerceiverAttentionCA cross-attention weights. Loaded
// once with the model. Per-generation pulid_id_embedding_path lives in
// SDGenerationParams below.
std::string pulid_weights_path;
sd_type_t wtype = SD_TYPE_COUNT;
std::string tensor_type_rules;
std::string lora_model_dir = ".";
Expand Down Expand Up @@ -196,6 +201,12 @@ struct SDGenerationParams {
std::string pm_id_embed_path;
float pm_style_strength = 20.f;

// PuLID-Flux: per-generation identity embedding (binary file produced by
// runtime-scripts/pulid_extract_id.py). Format documented in
// include/stable-diffusion.h sd_pulid_params_t.
std::string pulid_id_embedding_path;
float pulid_id_weight = 1.0f;

int upscale_repeats = 1;
int upscale_tile_size = 128;

Expand Down
34 changes: 34 additions & 0 deletions include/stable-diffusion.h
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,16 @@ typedef struct {
const sd_embedding_t* embeddings;
uint32_t embedding_count;
const char* photo_maker_path;
/**
* Path to pulid_flux_v0.9.1.safetensors (the PuLID identity-injection
* cross-attention weights). When set together with sd_img_gen_params_t.
* pulid_params.id_embedding_path, the Flux diffusion model performs PuLID
* cross-attention injection during the denoise loop. Loaded once with
* the model; the embedding is per-generation. Currently only meaningful
* for Flux (depth=19 double, 38 single blocks); silently ignored for
* other model versions.
*/
const char* pulid_weights_path;
const char* tensor_type_rules;
bool vae_decode_only;
bool free_params_immediately;
Expand Down Expand Up @@ -266,6 +276,29 @@ typedef struct {
float style_strength;
} sd_pm_params_t; // photo maker

/**
* PuLID-Flux identity preservation params.
*
* Unlike PhotoMaker (which extracts the ID embedding inside the inference
* process from a directory of images), PuLID's ID extraction is a heavy
* Python-only stack (insightface ArcFace + EVA-CLIP-L + IDFormer). To stay
* cross-vendor in C++/Vulkan, sd.cpp consumes a precomputed binary file
* produced by an external tool (runtime-scripts/pulid_extract_id.py in the
* Cloudhands client tree).
*
* Binary format (.pulidembd):
* offset 0 : magic "PULIDV01" (8 bytes ASCII)
* offset 8 : num_tokens (uint32 LE)
* offset 12 : token_dim (uint32 LE)
* offset 16 : dtype (uint8): 0=fp16, 1=bf16, 2=fp32
* offset 17 : reserved zeros (15 bytes; header = 32 bytes total)
* offset 32 : tokens, row-major LE (num_tokens * token_dim values)
*/
typedef struct {
const char* id_embedding_path; // path to .pulidembd file produced by pulid_extract_id.py
float id_weight; // strength of the ID injection; typical 0.7-1.2, default 1.0
} sd_pulid_params_t;

enum sd_cache_mode_t {
SD_CACHE_DISABLED = 0,
SD_CACHE_EASYCACHE,
Expand Down Expand Up @@ -358,6 +391,7 @@ typedef struct {
sd_image_t control_image;
float control_strength;
sd_pm_params_t pm_params;
sd_pulid_params_t pulid_params;
sd_tiling_params_t vae_tiling_params;
sd_cache_params_t cache;
sd_hires_params_t hires;
Expand Down
Loading