Skip to content

feat: PuLID-Flux identity-injection support#1542

Open
RapidMark wants to merge 1 commit into
leejet:masterfrom
CloudhandsAI:cloudhands/pulid-flux
Open

feat: PuLID-Flux identity-injection support#1542
RapidMark wants to merge 1 commit into
leejet:masterfrom
CloudhandsAI:cloudhands/pulid-flux

Conversation

@RapidMark
Copy link
Copy Markdown

This PR adds support for PuLID-Flux
identity preservation to the Flux denoise loop. Given a single source
portrait, generated images preserve the source person's face across
arbitrary scenes and prompts.

What's included

  • src/pulid.hppPuLIDPerceiverAttentionCA, the cross-attention
    module mirroring the PyTorch reference at
    ToTheBeginning/PuLID/.../encoders_transformer.py.
    Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without
    backend-specific code.
  • src/flux.hpp — adds 20 pulid_ca.<i> child blocks to Flux
    (constructed conditionally when params.pulid_enabled is set),
    inserts the cross-attention call between transformer blocks at the
    intervals the PyTorch reference uses (every 2nd double block, every
    4th single block), and threads two new optional parameters
    (pulid_id, pulid_id_weight) through forward, forward_orig,
    forward_chroma_radiance, forward_flux_chroma, compute, and
    build_graph.
  • src/stable-diffusion.cpp — loads pulid_*.safetensors via
    model_loader.init_from_file under the existing
    model.diffusion_model. prefix so PuLID-CA tensors bind to the new
    blocks naturally. PuLID-encoder keys (which live in the precompute
    tool, not in C++) are correctly identified as unknown. Adds
    load_pulid_id_embedding() to parse a small .pulidembd binary
    file and wraps its content as a sd::Tensor<float> passed via
    DiffusionParams.
  • include/stable-diffusion.h — public API: sd_pulid_params_t
    (per-generation embedding path + weight), pulid_weights_path on
    sd_ctx_params_t, pulid_params on sd_img_gen_params_t.
  • examples/common/common.{cpp,h} — three new CLI flags:
    --pulid-weights <path>, --pulid-id-embedding <path>, and
    --pulid-id-weight <float>.
  • src/diffusion_model.hpp — extends DiffusionParams to carry the
    new identity embedding + weight; FluxModel::compute forwards both
    through.
  • docs/pulid.md — usage, binary format spec, supported PuLID weight
    versions (v0.9.0 / v0.9.1; v1.1 deferred), memory budget notes, and
    a three-way SHA-256 falsification recipe.
  • scripts/pulid_extract_id.py — reference precompute tool that
    produces the .pulidembd binary from a source portrait. Lives
    outside the C++ build because identity extraction (insightface +
    EVA-CLIP-L + IDFormer) is a heavy PyTorch stack that would be
    impractical to port to ggml just to run once per source person.

Why split extraction from injection

PuLID-Flux's identity extractor is a stack of three large PyTorch
models (ArcFace face detector + EVA-CLIP-L vision encoder + IDFormer
perceiver-resampler). Porting all three to C++/ggml would add ~5000
lines for code that runs once per source person and produces a 131 KB
output. By making sd.cpp consume a precomputed binary file, the C++
surface area is small (~600 lines), the heavy ML stack only needs to
run once per person on any backend that supports PyTorch, and adding
PuLID is decoupled from the active development on insightface /
EVA-CLIP / IDFormer.

Binary format

offset 0   : magic "PULIDV01"      (8 bytes ASCII)
offset 8   : num_tokens (uint32 LE)
offset 12  : token_dim (uint32 LE)
offset 16  : dtype (uint8): 0=fp16, 1=bf16, 2=fp32
offset 17  : reserved zeros        (15 bytes; header total = 32)
offset 32  : tokens, row-major LE

Typical (32, 2048, fp16) = 131 KB.

Verification

The three-way SHA-256 falsification recipe in docs/pulid.md
distinguishes "the feature is wired but doesn't do anything" from
"the feature is actively altering the diffusion trajectory":

Run Expected hash relation
A: no --pulid-* flags baseline
B: PuLID flags, --pulid-id-weight 0.0 byte-identical to A
C: PuLID flags, --pulid-id-weight 1.0 differs, preserves source identity

Verified on three backends with the same source code:

  • Vulkan-AMD (RX 6700 XT, -DSD_VULKAN=ON): A == B byte-identical,
    A != C, C visually preserves source identity.
  • Vulkan-NVIDIA (RTX 3060, same binary, --backend "diffusion=vulkan1"):
    A == B, A != C, C visually equivalent to the AMD output at the same
    seed (different bytes per the usual cross-backend nondeterminism).
  • CUDA-NVIDIA (RTX 3060, separate -DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86
    build against CUDA 13.2): A == B byte-identical, A != C, C visually
    preserves source identity. PerceiverAttentionCA's pure-ggml graph
    code runs unchanged across all three backends -- no backend-specific
    conditionals were needed.

Per-image sampling times at 512x512 / 4 steps / Flux Schnell Q4 + PuLID:

Backend Sampling (s) Notes
AMD 6700 XT (Vulkan) 22 12 GB consumer card
NVIDIA 3060 (Vulkan) 11 same binary as AMD
NVIDIA 3060 (CUDA) 9.6 separate -DSD_CUDA=ON build

batch_count=3 was tested separately and confirms the long-lived-worker
amortization story: per-image sampling drops from 19.6 s (cold) to
~11 s (warm) as the model stays resident across batch iterations.

Tested with Flux Schnell Q4_K_S + PuLID v0.9.1 at 512x512 / 4 steps,
and Flux Dev Q4_K_S + PuLID v0.9.1 at 768x768 / 20 steps. 1024x1024 +
Dev + PuLID OOMs on a 12 GB card unless the VAE is routed to the CPU
backend via --backend "vae=cpu" (not just --vae-on-cpu, which only
offloads weights, not the compute buffer); this is existing
stable-diffusion.cpp behavior, not a PuLID-specific issue, but
documented in docs/pulid.md because PuLID users will hit it.

Tested with batch_count > 1 (verified each image gets the same
identity, different composition).

Not yet supported (called out in docs/pulid.md)

  • PuLID v1.1 (pulid_v1.1.safetensors) -- has renamed key layout
    (id_adapter_attn_layers.* vs pulid_ca.*) and potentially
    different module structure. Follow-up PR.
  • Multiple ID images fused into one embedding (the reference Python
    pipeline supports this; the current precompute tool accepts only
    one portrait per run).
  • The --true-cfg negative-prompt branch -- PuLID only injects on the
    positive conditioning path in the reference implementation; this
    matches.

Backward compatibility

Non-PuLID generations are unaffected. The params.pulid_enabled flag
defaults to false and is only set when the model loader sees a
pulid_ca.* tensor in the loaded safetensors file. A regression run
of Flux Schnell Q4 without --pulid-* flags produces byte-identical
output to pre-patch.

File summary

include/stable-diffusion.h          +34 / -0
src/stable-diffusion.cpp           +120 / -0
src/diffusion_model.hpp              +5 / -1
src/flux.hpp                       +106 / -10
src/pulid.hpp                      +127 / -0   (new)
examples/common/common.h             +6 / -0
examples/common/common.cpp          +19 / -0
docs/pulid.md                      +220 / -0   (new)
scripts/pulid_extract_id.py        +135 / -0   (new)

Total ~770 added lines, ~10 changed. No removed functionality.

This PR adds support for [PuLID-Flux](https://github.com/ToTheBeginning/PuLID)
identity preservation to the Flux denoise loop. Given a single source
portrait, generated images preserve the source person's face across
arbitrary scenes and prompts.

### What's included

- `src/pulid.hpp` — `PuLIDPerceiverAttentionCA`, the cross-attention
  module mirroring the PyTorch reference at
  [ToTheBeginning/PuLID/.../encoders_transformer.py](https://github.com/ToTheBeginning/PuLID/blob/main/pulid/encoders_transformer.py).
  Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without
  backend-specific code.
- `src/flux.hpp` — adds 20 `pulid_ca.<i>` child blocks to `Flux`
  (constructed conditionally when `params.pulid_enabled` is set),
  inserts the cross-attention call between transformer blocks at the
  intervals the PyTorch reference uses (every 2nd double block, every
  4th single block), and threads two new optional parameters
  (`pulid_id`, `pulid_id_weight`) through `forward`, `forward_orig`,
  `forward_chroma_radiance`, `forward_flux_chroma`, `compute`, and
  `build_graph`.
- `src/stable-diffusion.cpp` — loads `pulid_*.safetensors` via
  `model_loader.init_from_file` under the existing
  `model.diffusion_model.` prefix so PuLID-CA tensors bind to the new
  blocks naturally. PuLID-encoder keys (which live in the precompute
  tool, not in C++) are correctly identified as unknown. Adds
  `load_pulid_id_embedding()` to parse a small `.pulidembd` binary
  file and wraps its content as a `sd::Tensor<float>` passed via
  `DiffusionParams`.
- `include/stable-diffusion.h` — public API: `sd_pulid_params_t`
  (per-generation embedding path + weight), `pulid_weights_path` on
  `sd_ctx_params_t`, `pulid_params` on `sd_img_gen_params_t`.
- `examples/common/common.{cpp,h}` — three new CLI flags:
  `--pulid-weights <path>`, `--pulid-id-embedding <path>`, and
  `--pulid-id-weight <float>`.
- `src/diffusion_model.hpp` — extends `DiffusionParams` to carry the
  new identity embedding + weight; `FluxModel::compute` forwards both
  through.
- `docs/pulid.md` — usage, binary format spec, supported PuLID weight
  versions (v0.9.0 / v0.9.1; v1.1 deferred), memory budget notes, and
  a three-way SHA-256 falsification recipe.
- `scripts/pulid_extract_id.py` — reference precompute tool that
  produces the `.pulidembd` binary from a source portrait. Lives
  outside the C++ build because identity extraction (insightface +
  EVA-CLIP-L + IDFormer) is a heavy PyTorch stack that would be
  impractical to port to ggml just to run once per source person.

### Why split extraction from injection

PuLID-Flux's identity extractor is a stack of three large PyTorch
models (ArcFace face detector + EVA-CLIP-L vision encoder + IDFormer
perceiver-resampler). Porting all three to C++/ggml would add ~5000
lines for code that runs once per source person and produces a 131 KB
output. By making sd.cpp consume a precomputed binary file, the C++
surface area is small (~600 lines), the heavy ML stack only needs to
run once per person on any backend that supports PyTorch, and adding
PuLID is decoupled from the active development on insightface /
EVA-CLIP / IDFormer.

### Binary format

```
offset 0   : magic "PULIDV01"      (8 bytes ASCII)
offset 8   : num_tokens (uint32 LE)
offset 12  : token_dim (uint32 LE)
offset 16  : dtype (uint8): 0=fp16, 1=bf16, 2=fp32
offset 17  : reserved zeros        (15 bytes; header total = 32)
offset 32  : tokens, row-major LE
```

Typical (32, 2048, fp16) = 131 KB.

### Verification

The three-way SHA-256 falsification recipe in docs/pulid.md
distinguishes "the feature is wired but doesn't do anything" from
"the feature is actively altering the diffusion trajectory":

| Run                                     | Expected hash relation                    |
|-----------------------------------------|--------------------------------------------|
| A: no `--pulid-*` flags                 | baseline                                   |
| B: PuLID flags, `--pulid-id-weight 0.0` | byte-identical to A                        |
| C: PuLID flags, `--pulid-id-weight 1.0` | differs, preserves source identity         |

Verified on three backends with the same source code:

- **Vulkan-AMD** (RX 6700 XT, `-DSD_VULKAN=ON`): A == B byte-identical,
  A != C, C visually preserves source identity.
- **Vulkan-NVIDIA** (RTX 3060, same binary, `--backend "diffusion=vulkan1"`):
  A == B, A != C, C visually equivalent to the AMD output at the same
  seed (different bytes per the usual cross-backend nondeterminism).
- **CUDA-NVIDIA** (RTX 3060, separate `-DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86`
  build against CUDA 13.2): A == B byte-identical, A != C, C visually
  preserves source identity. PerceiverAttentionCA's pure-ggml graph
  code runs unchanged across all three backends -- no backend-specific
  conditionals were needed.

Per-image sampling times at 512x512 / 4 steps / Flux Schnell Q4 + PuLID:

| Backend                | Sampling (s) | Notes                          |
|------------------------|-------------:|--------------------------------|
| AMD 6700 XT (Vulkan)   | 22           | 12 GB consumer card            |
| NVIDIA 3060 (Vulkan)   | 11           | same binary as AMD             |
| NVIDIA 3060 (CUDA)     | 9.6          | separate `-DSD_CUDA=ON` build  |

batch_count=3 was tested separately and confirms the long-lived-worker
amortization story: per-image sampling drops from 19.6 s (cold) to
~11 s (warm) as the model stays resident across batch iterations.

Tested with Flux Schnell Q4_K_S + PuLID v0.9.1 at 512x512 / 4 steps,
and Flux Dev Q4_K_S + PuLID v0.9.1 at 768x768 / 20 steps. 1024x1024 +
Dev + PuLID OOMs on a 12 GB card unless the VAE is routed to the CPU
backend via `--backend "vae=cpu"` (not just `--vae-on-cpu`, which only
offloads weights, not the compute buffer); this is existing
stable-diffusion.cpp behavior, not a PuLID-specific issue, but
documented in docs/pulid.md because PuLID users will hit it.

Tested with batch_count > 1 (verified each image gets the same
identity, different composition).

### Not yet supported (called out in docs/pulid.md)

- PuLID v1.1 (`pulid_v1.1.safetensors`) -- has renamed key layout
  (`id_adapter_attn_layers.*` vs `pulid_ca.*`) and potentially
  different module structure. Follow-up PR.
- Multiple ID images fused into one embedding (the reference Python
  pipeline supports this; the current precompute tool accepts only
  one portrait per run).
- The `--true-cfg` negative-prompt branch -- PuLID only injects on the
  positive conditioning path in the reference implementation; this
  matches.

### Backward compatibility

Non-PuLID generations are unaffected. The `params.pulid_enabled` flag
defaults to false and is only set when the model loader sees a
`pulid_ca.*` tensor in the loaded safetensors file. A regression run
of Flux Schnell Q4 without `--pulid-*` flags produces byte-identical
output to pre-patch.

### File summary

```
include/stable-diffusion.h          +34 / -0
src/stable-diffusion.cpp           +120 / -0
src/diffusion_model.hpp              +5 / -1
src/flux.hpp                       +106 / -10
src/pulid.hpp                      +127 / -0   (new)
examples/common/common.h             +6 / -0
examples/common/common.cpp          +19 / -0
docs/pulid.md                      +220 / -0   (new)
scripts/pulid_extract_id.py        +135 / -0   (new)
```

Total ~770 added lines, ~10 changed. No removed functionality.
@RapidMark RapidMark force-pushed the cloudhands/pulid-flux branch from 616d8d0 to aef4d29 Compare May 22, 2026 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant