Skip to content

[Ministral-3-3B] Add VLM export, INT4 quantization, and evaluation pipeline#352

Open
titaiwangms wants to merge 36 commits into
mainfrom
ministral-3b-text-export
Open

[Ministral-3-3B] Add VLM export, INT4 quantization, and evaluation pipeline#352
titaiwangms wants to merge 36 commits into
mainfrom
ministral-3b-text-export

Conversation

@titaiwangms
Copy link
Copy Markdown

@titaiwangms titaiwangms commented Apr 8, 2026

Summary

Adds a complete Olive recipe for exporting Ministral-3-3B-Instruct-2512 (Pixtral) as a 3-model VLM pipeline for ONNX Runtime GenAI:

  • Text decoder — Olive/ModelBuilder (GQA attention, YaRN RoPE, k_quant_mixed INT4)
  • Vision encoder — Olive/MobiusBuilder declarative export (Pixtral, dynamic H×W, 2D RoPE) + INT8 RTN quantization
  • Embedding — Olive/MobiusBuilder export (token + image fusion, FP16/FP32 by target, no quantization)

Export Pipeline

Vision encoder and embedding are exported using the Olive MobiusBuilder pass (not direct mobius.build() API calls). This gives full Olive pass lifecycle management, caching, and config-driven parameterization.

Each target has its own vision_embedding_export.json config:

Target Config Precision Accelerator
cuda/ cuda/vision_embedding_export.json FP16 CUDAExecutionProvider
webgpu/ webgpu/vision_embedding_export.json FP16 WebGpuExecutionProvider
cpu_and_mobile/ cpu_and_mobile/vision_embedding_export.json FP32 CPU (default)

MobiusBuilder exports components_to_export=["vision_encoder", "embedding"], writing:

  • <target>/models/vision_encoder/model.onnx — input to INT8 RTN quantization
  • <target>/models/embedding/model.onnx — kept as-is (no quantization)

After quantization, the intermediate vision_encoder/ directory is cleaned up and the quantized output lands in vision/ (the name ort-genai expects). optimize.py is fully device-agnostic — the same code handles all three targets via --config-dir.

Recipe Configuration Detail

Step File Pass Notes
Text export+quant {target}/text.json ModelBuilder k_quant_mixed INT4, generates genai_config
Vision+embedding export {target}/vision_embedding_export.json MobiusBuilder FP16 or FP32, runtime=none
Vision quantization {target}/vision.json OnnxBlockWiseRtnQuantization INT8, reads from vision_encoder/

AI2D Benchmark Results

Config Samples Accuracy Latency (s/sample) TTFT Model Size
PyTorch CPU FP32 100 78.00% 22.84s
PyTorch CUDA FP16 500 76.40% 0.50s
ONNX CPU INT4 100 77.00% 5.85s 0.5ms 4.92 GB
ONNX CUDA INT4 500 74.00% 0.14s 1.5ms 3.6 GB

Latency Measurement: Per-sample end-to-end inference time (image in → text out). Excludes model loading. Measured with time.perf_counter() averaged over all samples.

Accuracy Gap

Comparison Gap
ONNX CUDA vs PyTorch CUDA 2.4pp
ONNX CPU vs PyTorch CPU 1.0pp

Note: PyTorch baseline includes FP8 to BF16 dequantization (Ministral-3-3B ships FP8-only).

Model Size Breakdown

CUDA (3.6 GB)

Component Quantization Size
Text decoder k_quant_mixed INT4 2.4G
Vision encoder INT8 RTN 409M
Embedding FP16 769M

CPU (4.92 GB)

Component Quantization Size
Text decoder k_quant_mixed INT4 2.7G
Vision encoder INT8 RTN 419M
Embedding FP32 1.6G

CPU model is larger because CPU EP uses FP32 for non-quantized weights (embedding, sensitive decoder layers).

Key Implementation Notes

  • MobiusBuilder used because Pixtral dynamic image dimensions cause torch.onnx.export/dynamo failures. MobiusBuilder constructs graphs declaratively and produces already-optimized ONNX (fused MHA, SkipLayerNorm).
  • runtime="none" on MobiusBuilder prevents writing genai_config.json/tokenizer files (those come from ModelBuilder text export).
  • components_to_export requires microsoft/Olive PR #2456 (MobiusBuilder filter, submitted upstream).
  • Embedding not quantized: avoids a GatherBlockQuantized accuracy bug.
  • vision_encoder/ to vision/ rename: MobiusBuilder uses the mobius package key name (vision_encoder); ort-genai expects vision/. Bridged in quantize_vision_and_embedding().
  • Unused initializer stripping: after INT8 quantization, _strip_unused_initializers() removes original FP32/FP16 weights (~87% reduction, e.g. 1.7 GB to 220 MB for vision).

Recent Fixes (commit ea2a218)

  • info.yml: fixed invalid YAML (list markers, recipes key) to pass catalog validation
  • eval.py: fixed running_acc off-by-one (include current hit before incrementing correct)
  • README: removed misleading --dtype f32 from CPU command; clarified --dtype is deprecated
  • optimize.py: added DeprecationWarning when non-default --dtype is passed; updated docstring

Model Verification

model-mm.py was run against both the old and new MobiusBuilder-based exports. Generated text output was identical in quality across all 3 test images (challenge.jpg, fish.jpg, image2.jpg), confirming the new recipe produces equivalent results.

Dependencies

  • onnxruntime-genai PR #2077 — Mistral3/Pixtral VLM runtime support (image processor, PixtralVisionState, multi-image)
  • onnxruntime-genai PR #2076 — YaRN RoPE parity fixes (merged)
  • onnxruntime-extensions PR #1050 — PixtralImageSizes custom op (merged)
  • microsoft/Olive PR #2456 — MobiusBuilder components_to_export filter (submitted, pending merge)
  • microsoft/Olive PR #2457 — OnnxBlockWiseRtnQuantization components_to_skip (submitted, pending merge)

Copilot AI review requested due to automatic review settings April 8, 2026 23:15
@titaiwangms titaiwangms marked this pull request as draft April 8, 2026 23:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “builtin” export + inference recipe for mistralai/Ministral-3-3B-Instruct-2512, targeting ONNX Runtime GenAI by exporting the text decoder via Olive/ModelBuilder and the vision/embedding pieces via Mobius, plus generating the runtime genai_config.json/processor_config.json.

Changes:

  • Introduces an end-to-end export/config-generation script (optimize.py) and a GenAI inference example (inference.py).
  • Adds Olive configs for CPU/mobile (INT4) and CUDA (FP16), along with recipe metadata (info.yml) and docs (README.md).
  • Adds custom patched modeling code under codes/ intended to support ONNX export.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
mistralai-Ministral-3-3B-Instruct-2512/builtin/user_script.py Adds model config constants (currently with import-time HF loading).
mistralai-Ministral-3-3B-Instruct-2512/builtin/requirements.txt Declares Olive + Mobius + torch/transformers dependencies.
mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md Documents export workflow, output layout, and inference usage.
mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Implements export pipeline and GenAI config/tokenizer patching.
mistralai-Ministral-3-3B-Instruct-2512/builtin/info.yml Registers builtin recipe metadata (keywords/EPs/devices/name).
mistralai-Ministral-3-3B-Instruct-2512/builtin/inference.py Provides a CLI to run text-only and multimodal inference with ORT GenAI.
mistralai-Ministral-3-3B-Instruct-2512/builtin/cuda/text.json Olive ModelBuilder config for FP16 CUDA decoder export.
mistralai-Ministral-3-3B-Instruct-2512/builtin/cpu_and_mobile/text.json Olive ModelBuilder config for INT4 CPU/mobile decoder export.
mistralai-Ministral-3-3B-Instruct-2512/builtin/codes/modeling_ministral3.py Adds patched model components for ONNX-export-friendly behavior.
mistralai-Ministral-3-3B-Instruct-2512/builtin/codes/init.py Exposes Ministral3Model symbol.
mistralai-Ministral-3-3B-Instruct-2512/builtin/.gitignore Ignores generated model artifacts and Olive cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/user_script.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/requirements.txt Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/inference.py
@titaiwangms titaiwangms force-pushed the ministral-3b-text-export branch 2 times, most recently from 8058122 to 9d5c64b Compare April 9, 2026 21:31
@titaiwangms titaiwangms changed the title Add Ministral-3-3B-Instruct-2512 recipe Add Ministral-3-3B VLM recipe: hybrid Olive + Mobius export Apr 9, 2026
@titaiwangms titaiwangms force-pushed the ministral-3b-text-export branch 2 times, most recently from b3f8592 to 5969770 Compare April 10, 2026 00:04
@titaiwangms titaiwangms marked this pull request as ready for review April 10, 2026 21:58
@titaiwangms titaiwangms force-pushed the ministral-3b-text-export branch from d3f7f6a to 5eb675d Compare April 14, 2026 20:23
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py
@titaiwangms titaiwangms force-pushed the ministral-3b-text-export branch 5 times, most recently from 9d5b928 to 7a914be Compare April 14, 2026 21:52
Complete olive recipe for Ministral-3-3B-Instruct-2512 VLM using:
- Text decoder: Olive/ModelBuilder (INT4 for both CPU and CUDA)
- Vision encoder + embedding: Mobius (dynamo-free ONNX construction)
- Vision INT4 quantization: Olive post-export (CPU only)
- context_length=32768, Permute3D transform in processor_config

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms force-pushed the ministral-3b-text-export branch from 7a914be to 1bdc231 Compare April 14, 2026 22:08
- Add _strip_unused_initializers to reduce INT4 model size (1.7GB→220MB)
- Add _fix_gather_block_quantized for RoPE cache preservation
- CUDA: INT4 text + FP16 vision (71.65% AI2D)
- CPU: INT4 text + INT4 vision (69.07% AI2D)
- Remove unnecessary genai_config overrides (trust ModelBuilder)
- Add comprehensive README with benchmark results
- Fix eval.py build_messages for Jinja sort compatibility

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/eval.py Fixed
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/eval.py Fixed
@titaiwangms titaiwangms changed the title Add Ministral-3-3B VLM recipe: hybrid Olive + Mobius export feat: Add Ministral-3-3B VLM recipe with INT4 quantization and eval benchmarks Apr 15, 2026
@titaiwangms titaiwangms requested a review from Copilot April 15, 2026 22:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/optimize.py
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/eval.py Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md Outdated
Comment thread mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md Outdated
titaiwangms and others added 5 commits April 15, 2026 22:39
- eval.py: Add explanatory comments to except-pass clauses
- optimize.py: Update docstring to match INT4 shipping config
- optimize.py: Document _get_hf_config MODEL_NAME usage
- optimize.py: Improve --dtype help text
- README.md: Fix precision labels (CUDA=INT4 text, CPU embedding=FP16)
- README.md: Remove stale FP32 embedding references

Note: eval.py dtype= kwarg is valid in transformers >=5.0

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ation

When --models-dir differs from the default (<config-dir>/models/),
text.json output_dir is hardcoded so exports go to the default location.
Copy the entire export tree to --models-dir after export so that
update_genai_config() and fix_tokenizer() find the files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CUDA graph capture is unsupported for VLMs with dynamic image sizes.
Set enable_cuda_graph=0 for ALL models (decoder, vision, embedding),
matching the Qwen VLM recipe convention.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Olive caches the full resolved config including absolute output_dir.
On re-runs with different --models-dir, the stale cache writes to the
old path, creating unexpected directories (e.g., ministral3-cpu-int4-test).
Clear the cache before each quantization run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- export_text_decoder: Load text.json as dict, override output_dir
- export_vision_and_embedding: Already accepts output_dir parameter
- quantize_vision_and_embedding: Load vision.json as dict, override
  model_path and output_dir
- Remove shutil.copytree post-export step from main()
- Remove .olive-cache clear (no longer needed)
- Pass models_dir through export_models() pipeline

This eliminates duplicate directories, copy overhead for multi-GB files,
and ghost directories from stale Olive cache paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
titaiwangms and others added 9 commits April 21, 2026 20:07
- README.md: Update all vision INT4→INT8 references, benchmark table, model sizes
- optimize.py: Update docstrings for INT8 vision, remove embedding.json references

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Pixtral vision encoder requires image_sizes metadata for variable-size
multi-image processing. Without this step, the processor_config.json lacks
the tensor needed by PixtralVisionState, causing multi-image inference to fail.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add note explaining that reported latency is per-sample end-to-end inference
time (image preprocessing + tokenization + generation + decoding), excluding
model loading. Update multi-image status from "not supported" to "supported"
now that PixtralImageSizes is implemented.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove inaccurate max 8 tokens claim (only applies to PyTorch baseline)
- Reconcile latency numbers between README and PR description (22.84s)
- Move multi-image support from Known Limitations to Notes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add webgpu config directory with INT4 block_size=32 and accuracy_level=4
for text decoder, INT8 block_size=32 for vision encoder. Both target
WebGpuExecutionProvider.

Add webgpu device option to optimize.py mapping to WebGpuExecutionProvider
with empty provider options (no CUDA-specific flags needed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@titaiwangms titaiwangms requested a review from xiaoyu-work April 29, 2026 05:14
@titaiwangms titaiwangms enabled auto-merge (squash) April 30, 2026 22:30
titaiwangms and others added 8 commits May 4, 2026 10:41
… vision/embedding

Replace the hand-rolled mobius.build() + pkg.save() export in optimize.py with a
proper Olive MobiusBuilder pass config, following the same pattern as text.json.

Changes:
- cuda/vision_embedding_export.json (new): MobiusBuilder pass with
  precision=fp16, runtime=none, components_to_export=['vision_encoder','embedding'].
  Outputs vision_encoder/model.onnx and embedding/model.onnx to models_dir.

- cuda/vision.json: update input model_path from cuda/models/vision/model.onnx
  to cuda/models/vision_encoder/model.onnx (MobiusBuilder component sub-dir name).

- optimize.py:
  - Replace export_vision_and_embedding() mobius implementation with Olive-based
    version that runs vision_embedding_export.json via olive.run().
  - quantize_vision_and_embedding(): source vision from vision_encoder/ (MobiusBuilder
    output name) and write quantized output to vision/ (ort-genai expected name).
  - _strip_unused_initializers(): now called on the quantized vision/ output, not
    the unquantized vision_encoder/ source.
  - export_models(): update docstring to reflect new output structure.

The --dtype arg is retained for backward compat but precision is now set in
vision_embedding_export.json (fp16). The --model-path arg continues to override
the HF model path in the Olive config.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- requirements.txt: advance mobius pin from @3777c18 to @41d2641
  (3777c18 uses models['vision'] key; 65d0ea4 renamed it to
  'vision_encoder' and 41d2641 is current HEAD)

- optimize.py module docstring: replace 'mobius direct API calls'
  language with 'Olive MobiusBuilder pass (vision_embedding_export.json)'
  to accurately reflect the current implementation

- optimize.py --dtype help text: correct the contradiction — dtype
  does not affect vision/embedding export (that precision is fixed
  in vision_embedding_export.json); the flag only controls text
  decoder quantization precision

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MobiusBuilder writes vision_encoder/ (~3.3 GB FP16) which is then
quantized into vision/ (INT8). Once quantization completes, the
intermediate vision_encoder/ directory is no longer needed.

Cleanup is placed after the quantization loop so it only runs on
success — any exception from Olive's run() or _strip_unused_initializers
propagates before reaching the shutil.rmtree() call, preserving the
intermediate files for debugging if quantization fails.

The embedding/ directory (FP16 final output) is left untouched.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Olive's CompositeModelHandler writes flat ONNX files to output_dir:
  models_dir/vision_encoder.onnx + vision_encoder.onnx.data
  models_dir/embedding.onnx      + embedding.onnx.data

But quantize_vision_and_embedding expects subdirectory layout:
  models_dir/vision_encoder/model.onnx + {component}.onnx.data

Changes in export_vision_and_embedding():
- After Olive run, reorganize flat files into subdirectories
- Keep original data filename (e.g. vision_encoder.onnx.data) so the
  relative external_data reference baked into model.onnx remains valid

Changes in quantize_vision_and_embedding():
- Guard _strip_unused_initializers with output existence check — Olive
  catches pass failures internally without raising, so without this guard
  a silent failure produces a confusing FileNotFoundError on the next line
- Cleanup of vision_encoder/ is now conditional on vision/model.onnx
  existing (i.e. quantization actually succeeded) to preserve intermediate
  files for debugging on failure

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… configs

Mirror the cuda MobiusBuilder-based vision+embedding export to the other targets:

webgpu/:
  - Add vision_embedding_export.json: MobiusBuilder fp16, WebGpuExecutionProvider,
    components_to_export=['vision_encoder','embedding']
  - Update vision.json input_model path: vision/ → vision_encoder/ (MobiusBuilder output)

cpu_and_mobile/:
  - Add vision_embedding_export.json: MobiusBuilder fp32 (safer for CPU/mobile),
    no explicit accelerator (CPU default), components_to_export=['vision_encoder','embedding']
  - Update vision.json input_model path: vision/ → vision_encoder/ (MobiusBuilder output)

optimize.py:
  - Update export_vision_and_embedding() docstring: remove cuda/ hardcoding,
    note fp16 for cuda/webgpu, fp32 for cpu_and_mobile
  - Update module and export_models() docstrings similarly

No logic changes needed in optimize.py — it already uses config_dir generically.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Vision/embedding precision depends on target config:
  - cuda/webgpu:     FP16
  - cpu_and_mobile:  FP32

Updated all affected locations:
  - export_models() docstring: vision_encoder/ and embedding/ entries
  - quantize_vision_and_embedding() docstring: 'Embedding stays' line
  - export_models() inline comments (FP16 → FP16 for cuda/webgpu, FP32 for cpu_and_mobile)
  - --dtype CLI help text: replaced 'those are always FP16' with accurate per-target description

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Comment on lines +2 to +8
olive-ai
ep:
CPUExecutionProvider
CUDAExecutionProvider
device:
cpu
gpu
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit ea2a218. Rewrote info.yml to match the builtin recipe schema used by other recipes (e.g., Qwen2.5-VL, Qwen3.5): proper YAML list markers for keywords, eps, devices, and added the required recipes: key with name and file. Validated with yaml.safe_load.

**CPU (k_quant_mixed INT4 text + INT8 vision + FP32 embedding):**

```bash
python optimize.py --config-dir cpu_and_mobile --device cpu --dtype f32
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit ea2a218. Removed --dtype f32 from the CPU command example (the flag is a no-op for the Olive-managed export). Updated the Notes section to explicitly state that --dtype is accepted for backward compatibility but does not control export precision — precision is set in the JSON config files (vision_embedding_export.json).

Comment on lines +269 to +275
Note: precision for vision/embedding export is set in vision_embedding_export.json
(fp16 for cuda/webgpu, fp32 for cpu_and_mobile). The --dtype CLI arg is accepted for backward compatibility but
does not affect the Olive-managed export.
"""
if models_dir is None:
models_dir = str(Path(config_dir) / MODELS_DIR)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit ea2a218. Added a DeprecationWarning in export_models() when a non-default --dtype is passed, pointing users to vision_embedding_export.json as the correct place to control precision. Updated the docstring to explicitly call out that the parameter does not affect the Olive-managed export. The parameter is kept for backward compatibility with any existing scripts that pass --dtype.

Comment on lines +406 to +413
if (i + 1) % 10 == 0 or i == 0:
print(
f" [{i + 1:4d}/{total}] gt={gt} pred={pred} raw={raw.strip()!r:20} "
f"{'✓' if hit else '✗'} running_acc={correct / (i + 1 - skipped + 1e-9):.3f}"
)

if hit:
correct += 1
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in commit ea2a218. Changed correct to correct + (1 if hit else 0) in the running_acc print so the displayed accuracy reflects the current sample's result before correct is incremented.

@titaiwangms
Copy link
Copy Markdown
Author

Review (synthesized from 4 reviewers: readability / code / critical / deep)

Convergent findings flagged with [N reviewers]. Severity follows worst-case among reviewers.

Major

1. --dtype is a documented no-op contradicting the README [2: code, readability]

optimize.py: the arg is parsed and passed around but never wired into any JSON config — text precision is hardcoded "int4" in text.json, vision/embedding precision is set per-target in vision_embedding_export.json. The README's CPU example explicitly recommends --dtype f32, which now does nothing:

python optimize.py --config-dir cpu_and_mobile --device cpu --dtype f32   # f32 is a no-op

Fix: either wire --dtype into a config["passes"][...]["precision"] override before olive.run(config), or remove it from the README/help text and mark the flag deprecated.

2. _get_hf_config() ignores --model-path [2: code, critical]

Always loads from the hardcoded MODEL_NAME HuggingFace ID, breaking the offline / local-checkpoint workflow the README advertises (--model-path /local/Ministral still tries to contact HF in update_genai_config()). Fix: try args.model_path first; fall back to MODEL_NAME only if loading fails.

3. parse_answer() takes the first matched digit [code]

eval.py: verbose model outputs frequently contain distractor digits before the final answer (e.g. "Looking at option 1, then 2, then... the answer is 3"). Established Qwen AI2D scripts in this repo use the last standalone [1-4]. The current implementation likely understates accuracy. Switch to the last-match pattern.

4. _strip_unused_initializers() — two correctness issues [critical]

optimize.py:

  • Non-atomic external-data rewrite. os.remove(data_path) followed by onnx.save(model, onnx_path). A crash, kill, or disk-full between the two leaves model.onnx pointing to missing external data — i.e. a corrupted artifact after an otherwise-successful quantization. Fix: save to staging filenames, validate, then atomic rename.
  • Shallow reachability. The walk only covers graph.node[].input and graph.input. It misses graph.output, initializers consumed only as graph outputs, and any references inside subgraphs (control-flow ops like If/Loop/Scan). A live initializer can be silently stripped, producing a corrupt model that fails far downstream. Fix: include graph.output, recurse into subgraphs in node attributes, and validate with onnx.checker.check_model(..., full_check=True) after rewrite. Or drop this in favor of an existing Olive pass.

5. vision_encodervision rename is load-bearing across 7 files [readability]

The same artifact is referenced by three different names: "vision_encoder" (in 4 JSONs as components_to_export and as model_path in vision.json), intermediate dir vision_encoder/, and final dir vision/ (what ORT GenAI expects). The pivot is one line buried in quantize_vision_and_embedding(). If mobius ever renames the component, 7 files break independently with no shared definition. Suggest constants at the top of optimize.py:

_MOBIUS_VISION_COMPONENT = "vision_encoder"
_GENAI_VISION_COMPONENT  = "vision"

…and a comment on the pivot line cross-referencing them.

6. Recipe depends on two unmerged Olive PRs without a version pin [critical]

requirements.txt installs git+https://github.com/microsoft/Olive.git@main. If this recipe lands before microsoft/Olive#2456 and #2457 are merged, users will hit "unknown PassConfigParam: components_to_export" / "components_to_skip" before export even starts. Either:

  • merge after both Olive PRs land, or
  • pin Olive to a commit SHA that contains both, or
  • add an explicit feature check at the top of optimize.py with a clear error.

7. eval.py enables trust_remote_code=True for arbitrary --pytorch_model [critical]

Mistral3ForConditionalGeneration.from_pretrained(model_id, ..., trust_remote_code=True)
AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Crosses a network trust boundary — passing a malicious HF repo executes arbitrary code during evaluation. Fix: default trust_remote_code=False; gate behind an explicit --trust-remote-code flag with a warning, or allowlist known Mistral IDs.

Minor

  • quantize_vision_and_embedding() loops over ("vision", "embedding"), but no embedding.json exists in any target dir — embedding is always silently skipped. The docstring explains it but the function name still misleads. Rename to quantize_vision(), or make the loop data-driven (glob existing JSONs).
  • eval.py loads lmms-lab/ai2d without a pinned revision — published benchmark numbers become non-reproducible if the dataset changes. Pin a revision and document the license.
  • inference.py --max_length is unbounded — consider capping or warning above some threshold to avoid runaway generation.

Praise

  • The recipe codifies a non-trivial three-stage pipeline cleanly and follows the established Qwen-VL recipe structure, making it easy to compare side-by-side.
  • runtime="none" on the MobiusBuilder pass to avoid colliding with ModelBuilder's genai_config.json is a careful read of the framework — verified correct (Mobius writes genai config only when runtime == ORT_GENAI).
  • Calling olive.run(config) directly rather than shelling out avoids shell-injection risk in optimize.py — good choice.

Recommended block-list before merge

  1. Pin Olive (Updating documentation #6) — without this the recipe simply can't run.
  2. parse_answer last-match (Introducing CI workflow for examples #3) — eval correctness; affects published numbers.
  3. _strip_unused_initializers atomicity + reachability (Update to use "Recipes" instead of "Examples" #4) — risks corrupting otherwise-good output.
  4. trust_remote_code default (Move images to .assets #7) — security posture.

Posted by GitHub Copilot CLI on behalf of @titaiwangms.

GitHub Copilot and others added 6 commits May 11, 2026 21:11
- info.yml: fix invalid YAML (list markers, recipes key) to pass catalog validation
- eval.py: fix running_acc off-by-one (include current hit before incrementing correct)
- README: remove misleading --dtype f32 from CPU command; clarify --dtype is deprecated
- optimize.py: add DeprecationWarning when non-default --dtype is passed; update docstring

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ull dir structure

- Fix CUDA model size: 3.83G → 3.6G (benchmark table and prose)
- Replace all direct 'Mobius'/'mobius' export references with 'Olive/MobiusBuilder pass'
- Update directory structure to show vision_embedding_export.json in all 3 targets
  (cuda, webgpu, cpu_and_mobile) and add webgpu target
- Update output structure: vision shows INT8 (post-quant), embedding shows FP16/FP32
- Update CPU/CUDA pipeline notes to CUDA/WebGPU vs CPU framing
- Update export_vision_and_embedding() code section description to reflect flat→subdir reorganization
- Top-level README is auto-generated from info.yml; no manual update needed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants