[Ministral-3-3B] Add VLM export, INT4 quantization, and evaluation pipeline#352
[Ministral-3-3B] Add VLM export, INT4 quantization, and evaluation pipeline#352titaiwangms wants to merge 36 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new “builtin” export + inference recipe for mistralai/Ministral-3-3B-Instruct-2512, targeting ONNX Runtime GenAI by exporting the text decoder via Olive/ModelBuilder and the vision/embedding pieces via Mobius, plus generating the runtime genai_config.json/processor_config.json.
Changes:
- Introduces an end-to-end export/config-generation script (
optimize.py) and a GenAI inference example (inference.py). - Adds Olive configs for CPU/mobile (INT4) and CUDA (FP16), along with recipe metadata (
info.yml) and docs (README.md). - Adds custom patched modeling code under
codes/intended to support ONNX export.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| mistralai-Ministral-3-3B-Instruct-2512/builtin/user_script.py | Adds model config constants (currently with import-time HF loading). |
| mistralai-Ministral-3-3B-Instruct-2512/builtin/requirements.txt | Declares Olive + Mobius + torch/transformers dependencies. |
| mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md | Documents export workflow, output layout, and inference usage. |
| mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py | Implements export pipeline and GenAI config/tokenizer patching. |
| mistralai-Ministral-3-3B-Instruct-2512/builtin/info.yml | Registers builtin recipe metadata (keywords/EPs/devices/name). |
| mistralai-Ministral-3-3B-Instruct-2512/builtin/inference.py | Provides a CLI to run text-only and multimodal inference with ORT GenAI. |
| mistralai-Ministral-3-3B-Instruct-2512/builtin/cuda/text.json | Olive ModelBuilder config for FP16 CUDA decoder export. |
| mistralai-Ministral-3-3B-Instruct-2512/builtin/cpu_and_mobile/text.json | Olive ModelBuilder config for INT4 CPU/mobile decoder export. |
| mistralai-Ministral-3-3B-Instruct-2512/builtin/codes/modeling_ministral3.py | Adds patched model components for ONNX-export-friendly behavior. |
| mistralai-Ministral-3-3B-Instruct-2512/builtin/codes/init.py | Exposes Ministral3Model symbol. |
| mistralai-Ministral-3-3B-Instruct-2512/builtin/.gitignore | Ignores generated model artifacts and Olive cache. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8058122 to
9d5c64b
Compare
b3f8592 to
5969770
Compare
d3f7f6a to
5eb675d
Compare
9d5b928 to
7a914be
Compare
Complete olive recipe for Ministral-3-3B-Instruct-2512 VLM using: - Text decoder: Olive/ModelBuilder (INT4 for both CPU and CUDA) - Vision encoder + embedding: Mobius (dynamo-free ONNX construction) - Vision INT4 quantization: Olive post-export (CPU only) - context_length=32768, Permute3D transform in processor_config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7a914be to
1bdc231
Compare
- Add _strip_unused_initializers to reduce INT4 model size (1.7GB→220MB) - Add _fix_gather_block_quantized for RoPE cache preservation - CUDA: INT4 text + FP16 vision (71.65% AI2D) - CPU: INT4 text + INT4 vision (69.07% AI2D) - Remove unnecessary genai_config overrides (trust ModelBuilder) - Add comprehensive README with benchmark results - Fix eval.py build_messages for Jinja sort compatibility Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- eval.py: Add explanatory comments to except-pass clauses - optimize.py: Update docstring to match INT4 shipping config - optimize.py: Document _get_hf_config MODEL_NAME usage - optimize.py: Improve --dtype help text - README.md: Fix precision labels (CUDA=INT4 text, CPU embedding=FP16) - README.md: Remove stale FP32 embedding references Note: eval.py dtype= kwarg is valid in transformers >=5.0 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ation When --models-dir differs from the default (<config-dir>/models/), text.json output_dir is hardcoded so exports go to the default location. Copy the entire export tree to --models-dir after export so that update_genai_config() and fix_tokenizer() find the files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CUDA graph capture is unsupported for VLMs with dynamic image sizes. Set enable_cuda_graph=0 for ALL models (decoder, vision, embedding), matching the Qwen VLM recipe convention. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Olive caches the full resolved config including absolute output_dir. On re-runs with different --models-dir, the stale cache writes to the old path, creating unexpected directories (e.g., ministral3-cpu-int4-test). Clear the cache before each quantization run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- export_text_decoder: Load text.json as dict, override output_dir - export_vision_and_embedding: Already accepts output_dir parameter - quantize_vision_and_embedding: Load vision.json as dict, override model_path and output_dir - Remove shutil.copytree post-export step from main() - Remove .olive-cache clear (no longer needed) - Pass models_dir through export_models() pipeline This eliminates duplicate directories, copy overhead for multi-GB files, and ghost directories from stale Olive cache paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- README.md: Update all vision INT4→INT8 references, benchmark table, model sizes - optimize.py: Update docstrings for INT8 vision, remove embedding.json references Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Pixtral vision encoder requires image_sizes metadata for variable-size multi-image processing. Without this step, the processor_config.json lacks the tensor needed by PixtralVisionState, causing multi-image inference to fail. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add note explaining that reported latency is per-sample end-to-end inference time (image preprocessing + tokenization + generation + decoding), excluding model loading. Update multi-image status from "not supported" to "supported" now that PixtralImageSizes is implemented. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove inaccurate max 8 tokens claim (only applies to PyTorch baseline) - Reconcile latency numbers between README and PR description (22.84s) - Move multi-image support from Known Limitations to Notes Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add webgpu config directory with INT4 block_size=32 and accuracy_level=4 for text decoder, INT8 block_size=32 for vision encoder. Both target WebGpuExecutionProvider. Add webgpu device option to optimize.py mapping to WebGpuExecutionProvider with empty provider options (no CUDA-specific flags needed). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… vision/embedding
Replace the hand-rolled mobius.build() + pkg.save() export in optimize.py with a
proper Olive MobiusBuilder pass config, following the same pattern as text.json.
Changes:
- cuda/vision_embedding_export.json (new): MobiusBuilder pass with
precision=fp16, runtime=none, components_to_export=['vision_encoder','embedding'].
Outputs vision_encoder/model.onnx and embedding/model.onnx to models_dir.
- cuda/vision.json: update input model_path from cuda/models/vision/model.onnx
to cuda/models/vision_encoder/model.onnx (MobiusBuilder component sub-dir name).
- optimize.py:
- Replace export_vision_and_embedding() mobius implementation with Olive-based
version that runs vision_embedding_export.json via olive.run().
- quantize_vision_and_embedding(): source vision from vision_encoder/ (MobiusBuilder
output name) and write quantized output to vision/ (ort-genai expected name).
- _strip_unused_initializers(): now called on the quantized vision/ output, not
the unquantized vision_encoder/ source.
- export_models(): update docstring to reflect new output structure.
The --dtype arg is retained for backward compat but precision is now set in
vision_embedding_export.json (fp16). The --model-path arg continues to override
the HF model path in the Olive config.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- requirements.txt: advance mobius pin from @3777c18 to @41d2641 (3777c18 uses models['vision'] key; 65d0ea4 renamed it to 'vision_encoder' and 41d2641 is current HEAD) - optimize.py module docstring: replace 'mobius direct API calls' language with 'Olive MobiusBuilder pass (vision_embedding_export.json)' to accurately reflect the current implementation - optimize.py --dtype help text: correct the contradiction — dtype does not affect vision/embedding export (that precision is fixed in vision_embedding_export.json); the flag only controls text decoder quantization precision Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MobiusBuilder writes vision_encoder/ (~3.3 GB FP16) which is then quantized into vision/ (INT8). Once quantization completes, the intermediate vision_encoder/ directory is no longer needed. Cleanup is placed after the quantization loop so it only runs on success — any exception from Olive's run() or _strip_unused_initializers propagates before reaching the shutil.rmtree() call, preserving the intermediate files for debugging if quantization fails. The embedding/ directory (FP16 final output) is left untouched. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Olive's CompositeModelHandler writes flat ONNX files to output_dir:
models_dir/vision_encoder.onnx + vision_encoder.onnx.data
models_dir/embedding.onnx + embedding.onnx.data
But quantize_vision_and_embedding expects subdirectory layout:
models_dir/vision_encoder/model.onnx + {component}.onnx.data
Changes in export_vision_and_embedding():
- After Olive run, reorganize flat files into subdirectories
- Keep original data filename (e.g. vision_encoder.onnx.data) so the
relative external_data reference baked into model.onnx remains valid
Changes in quantize_vision_and_embedding():
- Guard _strip_unused_initializers with output existence check — Olive
catches pass failures internally without raising, so without this guard
a silent failure produces a confusing FileNotFoundError on the next line
- Cleanup of vision_encoder/ is now conditional on vision/model.onnx
existing (i.e. quantization actually succeeded) to preserve intermediate
files for debugging on failure
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… configs
Mirror the cuda MobiusBuilder-based vision+embedding export to the other targets:
webgpu/:
- Add vision_embedding_export.json: MobiusBuilder fp16, WebGpuExecutionProvider,
components_to_export=['vision_encoder','embedding']
- Update vision.json input_model path: vision/ → vision_encoder/ (MobiusBuilder output)
cpu_and_mobile/:
- Add vision_embedding_export.json: MobiusBuilder fp32 (safer for CPU/mobile),
no explicit accelerator (CPU default), components_to_export=['vision_encoder','embedding']
- Update vision.json input_model path: vision/ → vision_encoder/ (MobiusBuilder output)
optimize.py:
- Update export_vision_and_embedding() docstring: remove cuda/ hardcoding,
note fp16 for cuda/webgpu, fp32 for cpu_and_mobile
- Update module and export_models() docstrings similarly
No logic changes needed in optimize.py — it already uses config_dir generically.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Vision/embedding precision depends on target config: - cuda/webgpu: FP16 - cpu_and_mobile: FP32 Updated all affected locations: - export_models() docstring: vision_encoder/ and embedding/ entries - quantize_vision_and_embedding() docstring: 'Embedding stays' line - export_models() inline comments (FP16 → FP16 for cuda/webgpu, FP32 for cpu_and_mobile) - --dtype CLI help text: replaced 'those are always FP16' with accurate per-target description Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| olive-ai | ||
| ep: | ||
| CPUExecutionProvider | ||
| CUDAExecutionProvider | ||
| device: | ||
| cpu | ||
| gpu |
There was a problem hiding this comment.
Fixed in commit ea2a218. Rewrote info.yml to match the builtin recipe schema used by other recipes (e.g., Qwen2.5-VL, Qwen3.5): proper YAML list markers for keywords, eps, devices, and added the required recipes: key with name and file. Validated with yaml.safe_load.
| **CPU (k_quant_mixed INT4 text + INT8 vision + FP32 embedding):** | ||
|
|
||
| ```bash | ||
| python optimize.py --config-dir cpu_and_mobile --device cpu --dtype f32 |
There was a problem hiding this comment.
Fixed in commit ea2a218. Removed --dtype f32 from the CPU command example (the flag is a no-op for the Olive-managed export). Updated the Notes section to explicitly state that --dtype is accepted for backward compatibility but does not control export precision — precision is set in the JSON config files (vision_embedding_export.json).
| Note: precision for vision/embedding export is set in vision_embedding_export.json | ||
| (fp16 for cuda/webgpu, fp32 for cpu_and_mobile). The --dtype CLI arg is accepted for backward compatibility but | ||
| does not affect the Olive-managed export. | ||
| """ | ||
| if models_dir is None: | ||
| models_dir = str(Path(config_dir) / MODELS_DIR) | ||
|
|
There was a problem hiding this comment.
Fixed in commit ea2a218. Added a DeprecationWarning in export_models() when a non-default --dtype is passed, pointing users to vision_embedding_export.json as the correct place to control precision. Updated the docstring to explicitly call out that the parameter does not affect the Olive-managed export. The parameter is kept for backward compatibility with any existing scripts that pass --dtype.
| if (i + 1) % 10 == 0 or i == 0: | ||
| print( | ||
| f" [{i + 1:4d}/{total}] gt={gt} pred={pred} raw={raw.strip()!r:20} " | ||
| f"{'✓' if hit else '✗'} running_acc={correct / (i + 1 - skipped + 1e-9):.3f}" | ||
| ) | ||
|
|
||
| if hit: | ||
| correct += 1 |
There was a problem hiding this comment.
Fixed in commit ea2a218. Changed correct to correct + (1 if hit else 0) in the running_acc print so the displayed accuracy reflects the current sample's result before correct is incremented.
Review (synthesized from 4 reviewers: readability / code / critical / deep)Convergent findings flagged with [N reviewers]. Severity follows worst-case among reviewers. Major1.
python optimize.py --config-dir cpu_and_mobile --device cpu --dtype f32 # f32 is a no-opFix: either wire 2. Always loads from the hardcoded 3.
4.
5. The same artifact is referenced by three different names: _MOBIUS_VISION_COMPONENT = "vision_encoder"
_GENAI_VISION_COMPONENT = "vision"…and a comment on the pivot line cross-referencing them. 6. Recipe depends on two unmerged Olive PRs without a version pin [critical]
7. Mistral3ForConditionalGeneration.from_pretrained(model_id, ..., trust_remote_code=True)
AutoProcessor.from_pretrained(model_id, trust_remote_code=True)Crosses a network trust boundary — passing a malicious HF repo executes arbitrary code during evaluation. Fix: default Minor
Praise
Recommended block-list before merge
|
- info.yml: fix invalid YAML (list markers, recipes key) to pass catalog validation - eval.py: fix running_acc off-by-one (include current hit before incrementing correct) - README: remove misleading --dtype f32 from CPU command; clarify --dtype is deprecated - optimize.py: add DeprecationWarning when non-default --dtype is passed; update docstring Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ull dir structure - Fix CUDA model size: 3.83G → 3.6G (benchmark table and prose) - Replace all direct 'Mobius'/'mobius' export references with 'Olive/MobiusBuilder pass' - Update directory structure to show vision_embedding_export.json in all 3 targets (cuda, webgpu, cpu_and_mobile) and add webgpu target - Update output structure: vision shows INT8 (post-quant), embedding shows FP16/FP32 - Update CPU/CUDA pipeline notes to CUDA/WebGPU vs CPU framing - Update export_vision_and_embedding() code section description to reflect flat→subdir reorganization - Top-level README is auto-generated from info.yml; no manual update needed Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Adds a complete Olive recipe for exporting Ministral-3-3B-Instruct-2512 (Pixtral) as a 3-model VLM pipeline for ONNX Runtime GenAI:
Export Pipeline
Vision encoder and embedding are exported using the Olive MobiusBuilder pass (not direct
mobius.build()API calls). This gives full Olive pass lifecycle management, caching, and config-driven parameterization.Each target has its own
vision_embedding_export.jsonconfig:cuda/cuda/vision_embedding_export.jsonwebgpu/webgpu/vision_embedding_export.jsoncpu_and_mobile/cpu_and_mobile/vision_embedding_export.jsonMobiusBuilder exports
components_to_export=["vision_encoder", "embedding"], writing:<target>/models/vision_encoder/model.onnx— input to INT8 RTN quantization<target>/models/embedding/model.onnx— kept as-is (no quantization)After quantization, the intermediate
vision_encoder/directory is cleaned up and the quantized output lands invision/(the name ort-genai expects).optimize.pyis fully device-agnostic — the same code handles all three targets via--config-dir.Recipe Configuration Detail
{target}/text.json{target}/vision_embedding_export.jsonruntime=none{target}/vision.jsonvision_encoder/AI2D Benchmark Results
Accuracy Gap
Model Size Breakdown
CUDA (3.6 GB)
CPU (4.92 GB)
Key Implementation Notes
torch.onnx.export/dynamo failures. MobiusBuilder constructs graphs declaratively and produces already-optimized ONNX (fused MHA, SkipLayerNorm).runtime="none"on MobiusBuilder prevents writinggenai_config.json/tokenizer files (those come from ModelBuilder text export).components_to_exportrequires microsoft/Olive PR #2456 (MobiusBuilder filter, submitted upstream).GatherBlockQuantizedaccuracy bug.vision_encoder/tovision/rename: MobiusBuilder uses the mobius package key name (vision_encoder); ort-genai expectsvision/. Bridged inquantize_vision_and_embedding()._strip_unused_initializers()removes original FP32/FP16 weights (~87% reduction, e.g. 1.7 GB to 220 MB for vision).Recent Fixes (commit ea2a218)
info.yml: fixed invalid YAML (list markers,recipeskey) to pass catalog validationeval.py: fixedrunning_accoff-by-one (include current hit before incrementingcorrect)README: removed misleading--dtype f32from CPU command; clarified--dtypeis deprecatedoptimize.py: addedDeprecationWarningwhen non-default--dtypeis passed; updated docstringModel Verification
model-mm.pywas run against both the old and new MobiusBuilder-based exports. Generated text output was identical in quality across all 3 test images (challenge.jpg, fish.jpg, image2.jpg), confirming the new recipe produces equivalent results.Dependencies
components_to_exportfilter (submitted, pending merge)components_to_skip(submitted, pending merge)