Skip to content

NV embedding layer loading#197

Open
z52527 wants to merge 2 commits into
triton-inference-server:mainfrom
z52527:nve-layer-loading
Open

NV embedding layer loading#197
z52527 wants to merge 2 commits into
triton-inference-server:mainfrom
z52527:nve-layer-loading

Conversation

@z52527

@z52527 z52527 commented Jun 10, 2026

Copy link
Copy Markdown

Problem

Some PT2/AOTI models need a one-time native initialization at model load — before the package is loaded — that cannot live inside model.pt2. A concrete example: a model whose custom ops resolve embedding tables from a process-global registry at execute time, where the weights are stored in sidecar files next to the package (not inside model.pt2). Loading the .pt2 does not load them, so the first inference request fails:

... no binding registered for layer_id=1
-> Inductor model instance execution failure ... model_container_runner.cpp:150

The Triton PyTorch backend currently has no entry point to run such load-time initialization, so today it can only be done with fragile workarounds (e.g. an LD_PRELOAD shim that hijacks an unrelated CUDA call to piggyback on its timing).

What this PR does

Adds a generic, optional model-init hook to the PT2 path. When a model's config.pbtxt sets the MODEL_INIT_LIBRARY parameter, the backend dlopen()s that shared library in LoadModel() — right before the AOTIModelPackageLoader is created (CUDA is already initialized there) — and calls a fixed C entry point. Design choices:

  • Backend stays generic — it does not link against the library and knows nothing about what it does. It only dlopens the .so, calls triton_pytorch_model_init(model_dir, device_index), and releases it (triton_pytorch_model_release + dlclose) on unload. All model-specific logic lives in the user-provided plug-in, built and shipped downstream.
  • No new dependency — the only added link is ${CMAKE_DL_LIBS} (CMake's built-in libdl variable, for dlopen/dlsym/dlclose). The backend links no vendor libraries, so the stock nvcr.io/nvidia/pytorch:26.05-py3 build is unaffected.
  • No-op by default — a model without the parameter is completely unaffected; a dlopen/dlsym failure fails model load (fail-closed).
  • Scoped to the model lifetime — the handle is held in ModelState and freed in the destructor.
  • Fixes a pre-existing parameters bugParseParameters() was guarded by if (!ModelConfig().Find("parameters", ...)), so PT2 parameters were parsed only when the section was absent; any model that set a parameters block had them all (INFERENCE_MODE, …) silently ignored. Dropping the ! makes them take effect.

The library must export:

// Called once at model load, before the package is loaded.
void* triton_pytorch_model_init(const char* model_dir, int device_index);
// Optional; called once on unload with the handle returned above.
void  triton_pytorch_model_release(void* state);

Verified end-to-end on the official nvcr.io/nvidia/tritonserver:26.05-py3 image with an AOTI model (parameter parsed → hook loaded → model READY → inference correct); also compiles against main.

Comment thread CMakeLists.txt
Comment thread CMakeLists.txt Outdated
Comment thread README.md
Some PT2/AOTI models need one-time native initialization at model load -- before
the package is loaded -- that cannot live inside model.pt2. For example, a model
whose custom ops resolve weights from a process-global registry at execute time,
where those weights are stored outside the package and must be read in and
registered first.

Add an optional, per-model hook. When a model's config.pbtxt sets the
MODEL_INIT_LIBRARY parameter, the backend dlopen()s that shared library at model
load and calls its triton_pytorch_model_init(model_dir, device_index) entry
point, holds the returned handle in ModelState, and releases it
(triton_pytorch_model_fini + dlclose) on unload. The backend does not link
against the library and knows nothing about what it does -- it only loads it,
calls a fixed C entry point, and releases it on unload.

Unset by default (a complete no-op). dlopen/dlsym failure fails model load.

This also fixes a pre-existing bug in ParseParameters(): the block was guarded
by `if (!ModelConfig().Find("parameters", ...))`, i.e. parameters were parsed
only when the block was ABSENT. Any PT2 model that actually provides a
`parameters` section had every parameter (INFERENCE_MODE, CACHE_CLEANING_ENABLED,
... and MODEL_INIT_LIBRARY) silently ignored. Dropping the `!` makes parameters
take effect when present.

- src/pt2/model_state.{cc,hh}: fix the inverted Find check; parse
  MODEL_INIT_LIBRARY (string_value via MemberAsString); run the hook in
  LoadModel; free in the destructor
- CMakeLists.txt: link ${CMAKE_DL_LIBS} for dlopen/dlsym/dlclose
- README.md: document the parameter, the C ABI, and the trust note

Verified end-to-end on nvcr.io/nvidia/tritonserver:26.05-py3 with an HSTU
GR-ranking AOTI model: parameters parsed, hook dlopen'd, 2 NVE layers loaded into
the process-global registry, model READY, inference returns correct logits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Runchu Zhao <zhaorunchu@gmail.com>
@z52527 z52527 force-pushed the nve-layer-loading branch from 52831c9 to 13b1e93 Compare June 11, 2026 10:16
whoisj
whoisj previously approved these changes Jun 11, 2026

@whoisj whoisj left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is good. Thank you for the contribution. I've left a couple of comments but am willing to accept this change as-is.

Comment thread src/pt2/model_state.cc
Comment thread src/pt2/model_state.cc
Comment thread CMakeLists.txt
triton-core-backendapi # from repo-core
triton-core-serverstub # from repo-core
triton-backend-utils # from repo-backend
${CMAKE_DL_LIBS} # dlopen/dlsym/dlclose for the MODEL_INIT_LIBRARY hook

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMAKE_ prefix is a reserved keyword, afraid it's usage may create some confusion. See examples below.
http://cmake.org/cmake/help/latest/manual/cmake-variables.7.html

Also change description explains following, which means those libraries are vital to preserve the functionality, isn't?

An HSTU AOTI package (model.pt2) calls nve_ops::embedding_lookup(keys, layer_id),
which looks the embedding table up by layer_id in a process-global
NVELayerRegistry. The embedding weights do not live inside model.pt2 —
they are separate files next to it (<model_dir>/metadata.json +
<model_dir>/weights/*.nve). So loading the .pt2 does not load them, and the
first inference request fails

Could you please share the missed libraries origin ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@z52527 please follow up with @mc-nv's requests. This is blocking merge.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion. PR description was stale; it still described the earlier "link NVE into the backend" version. I've updated it.
In the current revision the backend links no NVE or vendor libraries. There are no missing libraries to source. The only CMake change is target_link_libraries(... ${CMAKE_DL_LIBS}). CMAKE_DL_LIBS is a built-in CMake variable for the platform's dynamic-loading library (libdl), needed for the dlopen/dlsym/dlclose the hook uses.

@whoisj

whoisj commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

CI Pipeline ID: 55028851

@whoisj

whoisj commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

I've pushed a change to make the pre-commit hook happy. @z52527, please pull your branch to receive my contribution before working on it to avoid conflicts; or duplicate my work in your own commit. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants