feat: PuLID-Flux identity-injection support#1542
Open
RapidMark wants to merge 1 commit into
Open
Conversation
This PR adds support for [PuLID-Flux](https://github.com/ToTheBeginning/PuLID) identity preservation to the Flux denoise loop. Given a single source portrait, generated images preserve the source person's face across arbitrary scenes and prompts. ### What's included - `src/pulid.hpp` — `PuLIDPerceiverAttentionCA`, the cross-attention module mirroring the PyTorch reference at [ToTheBeginning/PuLID/.../encoders_transformer.py](https://github.com/ToTheBeginning/PuLID/blob/main/pulid/encoders_transformer.py). Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without backend-specific code. - `src/flux.hpp` — adds 20 `pulid_ca.<i>` child blocks to `Flux` (constructed conditionally when `params.pulid_enabled` is set), inserts the cross-attention call between transformer blocks at the intervals the PyTorch reference uses (every 2nd double block, every 4th single block), and threads two new optional parameters (`pulid_id`, `pulid_id_weight`) through `forward`, `forward_orig`, `forward_chroma_radiance`, `forward_flux_chroma`, `compute`, and `build_graph`. - `src/stable-diffusion.cpp` — loads `pulid_*.safetensors` via `model_loader.init_from_file` under the existing `model.diffusion_model.` prefix so PuLID-CA tensors bind to the new blocks naturally. PuLID-encoder keys (which live in the precompute tool, not in C++) are correctly identified as unknown. Adds `load_pulid_id_embedding()` to parse a small `.pulidembd` binary file and wraps its content as a `sd::Tensor<float>` passed via `DiffusionParams`. - `include/stable-diffusion.h` — public API: `sd_pulid_params_t` (per-generation embedding path + weight), `pulid_weights_path` on `sd_ctx_params_t`, `pulid_params` on `sd_img_gen_params_t`. - `examples/common/common.{cpp,h}` — three new CLI flags: `--pulid-weights <path>`, `--pulid-id-embedding <path>`, and `--pulid-id-weight <float>`. - `src/diffusion_model.hpp` — extends `DiffusionParams` to carry the new identity embedding + weight; `FluxModel::compute` forwards both through. - `docs/pulid.md` — usage, binary format spec, supported PuLID weight versions (v0.9.0 / v0.9.1; v1.1 deferred), memory budget notes, and a three-way SHA-256 falsification recipe. - `scripts/pulid_extract_id.py` — reference precompute tool that produces the `.pulidembd` binary from a source portrait. Lives outside the C++ build because identity extraction (insightface + EVA-CLIP-L + IDFormer) is a heavy PyTorch stack that would be impractical to port to ggml just to run once per source person. ### Why split extraction from injection PuLID-Flux's identity extractor is a stack of three large PyTorch models (ArcFace face detector + EVA-CLIP-L vision encoder + IDFormer perceiver-resampler). Porting all three to C++/ggml would add ~5000 lines for code that runs once per source person and produces a 131 KB output. By making sd.cpp consume a precomputed binary file, the C++ surface area is small (~600 lines), the heavy ML stack only needs to run once per person on any backend that supports PyTorch, and adding PuLID is decoupled from the active development on insightface / EVA-CLIP / IDFormer. ### Binary format ``` offset 0 : magic "PULIDV01" (8 bytes ASCII) offset 8 : num_tokens (uint32 LE) offset 12 : token_dim (uint32 LE) offset 16 : dtype (uint8): 0=fp16, 1=bf16, 2=fp32 offset 17 : reserved zeros (15 bytes; header total = 32) offset 32 : tokens, row-major LE ``` Typical (32, 2048, fp16) = 131 KB. ### Verification The three-way SHA-256 falsification recipe in docs/pulid.md distinguishes "the feature is wired but doesn't do anything" from "the feature is actively altering the diffusion trajectory": | Run | Expected hash relation | |-----------------------------------------|--------------------------------------------| | A: no `--pulid-*` flags | baseline | | B: PuLID flags, `--pulid-id-weight 0.0` | byte-identical to A | | C: PuLID flags, `--pulid-id-weight 1.0` | differs, preserves source identity | Verified on three backends with the same source code: - **Vulkan-AMD** (RX 6700 XT, `-DSD_VULKAN=ON`): A == B byte-identical, A != C, C visually preserves source identity. - **Vulkan-NVIDIA** (RTX 3060, same binary, `--backend "diffusion=vulkan1"`): A == B, A != C, C visually equivalent to the AMD output at the same seed (different bytes per the usual cross-backend nondeterminism). - **CUDA-NVIDIA** (RTX 3060, separate `-DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86` build against CUDA 13.2): A == B byte-identical, A != C, C visually preserves source identity. PerceiverAttentionCA's pure-ggml graph code runs unchanged across all three backends -- no backend-specific conditionals were needed. Per-image sampling times at 512x512 / 4 steps / Flux Schnell Q4 + PuLID: | Backend | Sampling (s) | Notes | |------------------------|-------------:|--------------------------------| | AMD 6700 XT (Vulkan) | 22 | 12 GB consumer card | | NVIDIA 3060 (Vulkan) | 11 | same binary as AMD | | NVIDIA 3060 (CUDA) | 9.6 | separate `-DSD_CUDA=ON` build | batch_count=3 was tested separately and confirms the long-lived-worker amortization story: per-image sampling drops from 19.6 s (cold) to ~11 s (warm) as the model stays resident across batch iterations. Tested with Flux Schnell Q4_K_S + PuLID v0.9.1 at 512x512 / 4 steps, and Flux Dev Q4_K_S + PuLID v0.9.1 at 768x768 / 20 steps. 1024x1024 + Dev + PuLID OOMs on a 12 GB card unless the VAE is routed to the CPU backend via `--backend "vae=cpu"` (not just `--vae-on-cpu`, which only offloads weights, not the compute buffer); this is existing stable-diffusion.cpp behavior, not a PuLID-specific issue, but documented in docs/pulid.md because PuLID users will hit it. Tested with batch_count > 1 (verified each image gets the same identity, different composition). ### Not yet supported (called out in docs/pulid.md) - PuLID v1.1 (`pulid_v1.1.safetensors`) -- has renamed key layout (`id_adapter_attn_layers.*` vs `pulid_ca.*`) and potentially different module structure. Follow-up PR. - Multiple ID images fused into one embedding (the reference Python pipeline supports this; the current precompute tool accepts only one portrait per run). - The `--true-cfg` negative-prompt branch -- PuLID only injects on the positive conditioning path in the reference implementation; this matches. ### Backward compatibility Non-PuLID generations are unaffected. The `params.pulid_enabled` flag defaults to false and is only set when the model loader sees a `pulid_ca.*` tensor in the loaded safetensors file. A regression run of Flux Schnell Q4 without `--pulid-*` flags produces byte-identical output to pre-patch. ### File summary ``` include/stable-diffusion.h +34 / -0 src/stable-diffusion.cpp +120 / -0 src/diffusion_model.hpp +5 / -1 src/flux.hpp +106 / -10 src/pulid.hpp +127 / -0 (new) examples/common/common.h +6 / -0 examples/common/common.cpp +19 / -0 docs/pulid.md +220 / -0 (new) scripts/pulid_extract_id.py +135 / -0 (new) ``` Total ~770 added lines, ~10 changed. No removed functionality.
616d8d0 to
aef4d29
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for PuLID-Flux
identity preservation to the Flux denoise loop. Given a single source
portrait, generated images preserve the source person's face across
arbitrary scenes and prompts.
What's included
src/pulid.hpp—PuLIDPerceiverAttentionCA, the cross-attentionmodule mirroring the PyTorch reference at
ToTheBeginning/PuLID/.../encoders_transformer.py.
Pure-ggml graph; runs on CPU / CUDA / Vulkan / Metal without
backend-specific code.
src/flux.hpp— adds 20pulid_ca.<i>child blocks toFlux(constructed conditionally when
params.pulid_enabledis set),inserts the cross-attention call between transformer blocks at the
intervals the PyTorch reference uses (every 2nd double block, every
4th single block), and threads two new optional parameters
(
pulid_id,pulid_id_weight) throughforward,forward_orig,forward_chroma_radiance,forward_flux_chroma,compute, andbuild_graph.src/stable-diffusion.cpp— loadspulid_*.safetensorsviamodel_loader.init_from_fileunder the existingmodel.diffusion_model.prefix so PuLID-CA tensors bind to the newblocks naturally. PuLID-encoder keys (which live in the precompute
tool, not in C++) are correctly identified as unknown. Adds
load_pulid_id_embedding()to parse a small.pulidembdbinaryfile and wraps its content as a
sd::Tensor<float>passed viaDiffusionParams.include/stable-diffusion.h— public API:sd_pulid_params_t(per-generation embedding path + weight),
pulid_weights_pathonsd_ctx_params_t,pulid_paramsonsd_img_gen_params_t.examples/common/common.{cpp,h}— three new CLI flags:--pulid-weights <path>,--pulid-id-embedding <path>, and--pulid-id-weight <float>.src/diffusion_model.hpp— extendsDiffusionParamsto carry thenew identity embedding + weight;
FluxModel::computeforwards boththrough.
docs/pulid.md— usage, binary format spec, supported PuLID weightversions (v0.9.0 / v0.9.1; v1.1 deferred), memory budget notes, and
a three-way SHA-256 falsification recipe.
scripts/pulid_extract_id.py— reference precompute tool thatproduces the
.pulidembdbinary from a source portrait. Livesoutside the C++ build because identity extraction (insightface +
EVA-CLIP-L + IDFormer) is a heavy PyTorch stack that would be
impractical to port to ggml just to run once per source person.
Why split extraction from injection
PuLID-Flux's identity extractor is a stack of three large PyTorch
models (ArcFace face detector + EVA-CLIP-L vision encoder + IDFormer
perceiver-resampler). Porting all three to C++/ggml would add ~5000
lines for code that runs once per source person and produces a 131 KB
output. By making sd.cpp consume a precomputed binary file, the C++
surface area is small (~600 lines), the heavy ML stack only needs to
run once per person on any backend that supports PyTorch, and adding
PuLID is decoupled from the active development on insightface /
EVA-CLIP / IDFormer.
Binary format
Typical (32, 2048, fp16) = 131 KB.
Verification
The three-way SHA-256 falsification recipe in docs/pulid.md
distinguishes "the feature is wired but doesn't do anything" from
"the feature is actively altering the diffusion trajectory":
--pulid-*flags--pulid-id-weight 0.0--pulid-id-weight 1.0Verified on three backends with the same source code:
-DSD_VULKAN=ON): A == B byte-identical,A != C, C visually preserves source identity.
--backend "diffusion=vulkan1"):A == B, A != C, C visually equivalent to the AMD output at the same
seed (different bytes per the usual cross-backend nondeterminism).
-DSD_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86build against CUDA 13.2): A == B byte-identical, A != C, C visually
preserves source identity. PerceiverAttentionCA's pure-ggml graph
code runs unchanged across all three backends -- no backend-specific
conditionals were needed.
Per-image sampling times at 512x512 / 4 steps / Flux Schnell Q4 + PuLID:
-DSD_CUDA=ONbuildbatch_count=3 was tested separately and confirms the long-lived-worker
amortization story: per-image sampling drops from 19.6 s (cold) to
~11 s (warm) as the model stays resident across batch iterations.
Tested with Flux Schnell Q4_K_S + PuLID v0.9.1 at 512x512 / 4 steps,
and Flux Dev Q4_K_S + PuLID v0.9.1 at 768x768 / 20 steps. 1024x1024 +
Dev + PuLID OOMs on a 12 GB card unless the VAE is routed to the CPU
backend via
--backend "vae=cpu"(not just--vae-on-cpu, which onlyoffloads weights, not the compute buffer); this is existing
stable-diffusion.cpp behavior, not a PuLID-specific issue, but
documented in docs/pulid.md because PuLID users will hit it.
Tested with batch_count > 1 (verified each image gets the same
identity, different composition).
Not yet supported (called out in docs/pulid.md)
pulid_v1.1.safetensors) -- has renamed key layout(
id_adapter_attn_layers.*vspulid_ca.*) and potentiallydifferent module structure. Follow-up PR.
pipeline supports this; the current precompute tool accepts only
one portrait per run).
--true-cfgnegative-prompt branch -- PuLID only injects on thepositive conditioning path in the reference implementation; this
matches.
Backward compatibility
Non-PuLID generations are unaffected. The
params.pulid_enabledflagdefaults to false and is only set when the model loader sees a
pulid_ca.*tensor in the loaded safetensors file. A regression runof Flux Schnell Q4 without
--pulid-*flags produces byte-identicaloutput to pre-patch.
File summary
Total ~770 added lines, ~10 changed. No removed functionality.