-
Notifications
You must be signed in to change notification settings - Fork 48
[Ministral-3-3B] Add VLM export, INT4 quantization, and evaluation pipeline #352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
titaiwangms
wants to merge
36
commits into
main
Choose a base branch
from
ministral-3b-text-export
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 30 commits
Commits
Show all changes
36 commits
Select commit
Hold shift + click to select a range
1bdc231
Add Ministral-3-3B VLM recipe: hybrid Olive + Mobius export
titaiwangms c29131f
feat: finalize INT4 quantization pipeline with eval benchmarks
titaiwangms 3acaa7f
Address 8 review comments: docs, code quality, consistency
titaiwangms 25d460e
Fix --models-dir path: copy exports to custom dir before config gener…
titaiwangms d164890
Disable CUDA graph capture for VLM (matches Qwen convention)
titaiwangms 860675c
Fix: clear Olive cache before quantization to prevent stale output paths
titaiwangms 25009c2
Refactor: write exports directly to --models-dir, eliminate copy
titaiwangms b8d8570
Add CUDA vision.json for INT4 quantization of vision model
titaiwangms 860a27f
Increase eval.py max_length from 2000 to 8192 to evaluate all samples
titaiwangms 11f7801
Add TTFT and model size reporting to eval.py
titaiwangms a6c7888
Upgrade quantization to k_quant_mixed, add FP8 dequant to eval, add e…
titaiwangms b321a20
Upgrade vision to INT8, clean up optimize.py, remove embedding configs
titaiwangms b90a96c
Fix stale INT4 references and update docs for INT8 vision
titaiwangms d90b513
Merge branch 'main' into ministral-3b-text-export
titaiwangms ec68fac
Add PixtralImageSizes to processor pipeline for multi-image support
titaiwangms f8425b3
Clarify latency measurement methodology in README
titaiwangms 292b1c2
Fix reviewer feedback on README latency docs
titaiwangms c10ad83
Merge branch 'main' into ministral-3b-text-export
titaiwangms 469d6b2
Add WebGPU recipe for Ministral-3-3B
titaiwangms 342a907
Merge branch 'main' into ministral-3b-text-export
titaiwangms 1eeceb8
Merge branch 'main' into ministral-3b-text-export
titaiwangms b944184
Merge branch 'main' into ministral-3b-text-export
titaiwangms a548984
Merge branch 'main' into ministral-3b-text-export
titaiwangms 1fafb45
Merge branch 'main' into ministral-3b-text-export
titaiwangms c9e15ac
feat: replace direct mobius.build() with Olive MobiusBuilder pass for…
6c48739
fix: advance mobius pin, update docstring and dtype help text
da17dfb
fix: clean up intermediate vision_encoder/ after quantization succeeds
93fbee4
fix: correct Olive MobiusBuilder output layout for quantization pipeline
9079c09
feat: apply MobiusBuilder export pattern to webgpu and cpu_and_mobile…
3b7d85d
docs: fix FP16-only precision claims in optimize.py
ea2a218
fix: address PR #352 review issues
9a0c763
Merge branch 'main' into ministral-3b-text-export
titaiwangms fb569bc
docs: update README with MobiusBuilder approach, 3.6G size, webgpu, f…
c24ccac
Fix model accuracy
hanbitmyths b656d03
Merge branch 'main' into ministral-3b-text-export
hanbitmyths 7c76d48
Fix pre-commit failure
hanbitmyths File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| # Generated model artifacts | ||
| models/ | ||
|
|
||
| # Python bytecode | ||
| __pycache__/ | ||
| *.pyc | ||
|
|
||
| # Olive cache | ||
| .olive-cache/ |
219 changes: 219 additions & 0 deletions
219
mistralai-Ministral-3-3B-Instruct-2512/builtin/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,219 @@ | ||
| # Ministral-3-3B ONNX Runtime GenAI Example | ||
|
|
||
| This example demonstrates how to convert [Ministral-3-3B-Instruct-2512](https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512) vision-language model to ONNX format using Olive and run inference with ONNX Runtime GenAI. | ||
|
|
||
| Ministral-3-3B is a multimodal (VLM) model combining a Pixtral vision encoder with a Mistral text decoder using YaRN RoPE for extended context. The pipeline exports three sub-models: | ||
| - **Vision encoder** and **embedding** via [mobius](https://github.com/onnxruntime/mobius) (declarative ONNX graph construction); vision INT8-quantized via Olive | ||
| - **Text decoder** via Olive/ModelBuilder (GQA + k_quant_mixed INT4 quantization) | ||
|
|
||
| ## Exported Configurations | ||
|
|
||
| | Component | CUDA | CPU | | ||
| |-----------|------|-----| | ||
| | Text decoder | k_quant_mixed INT4 (`MatMulNBits`) | k_quant_mixed INT4 (`MatMulNBits`) | | ||
| | Vision encoder | INT8 RTN (`MatMul8Bits`) | INT8 RTN (`MatMul8Bits`) | | ||
| | Embedding | FP16 | FP32 | | ||
|
|
||
| - **CUDA**: k_quant_mixed INT4 text decoder + INT8 vision + FP16 embedding. Optimized for throughput on NVIDIA GPUs. | ||
| - **CPU**: k_quant_mixed INT4 text decoder + INT8 vision + FP32 embedding. Uses FP32 for embedding (CPU EP promotes FP16 to FP32). | ||
|
|
||
| ## Benchmark Results | ||
|
|
||
| Evaluated on [AI2D](https://allenai.org/data/diagrams) (science diagram multiple-choice QA, 4 options per question). | ||
|
|
||
| | Configuration | Accuracy | Samples | Model Size | Latency (s/sample) | | ||
| |---------------|----------|---------|------------|---------------------| | ||
| | PyTorch FP16 (CUDA) | 76.40% | 500 | ~7G | 0.14 | | ||
| | PyTorch FP32 (CPU) | 78.00% | 100 | ~7G | 22.84 | | ||
| | ONNX CUDA (k_quant_mixed + INT8 vision) | 74.00% | 500 | 3.83G | 0.14 | | ||
| | ONNX CPU (k_quant_mixed + INT8 vision) | 77.00% | 100 | 4.92G | 5.85 | | ||
|
|
||
| ONNX CUDA is within 2.4pp of PyTorch FP16 at **half the model size** (3.83G vs ~7G). | ||
| ONNX CPU achieves near-parity with PyTorch CPU at **70% smaller** (4.92G vs ~7G) and **3.9× faster**. | ||
|
|
||
| > **Latency Measurement:** Per-sample end-to-end inference time (image in → text out). Includes image preprocessing, tokenization, vision encoding, text generation, and decoding. Answers are short (typically 1-2 tokens for multiple-choice). Excludes model loading (one-time cost). Measured with `time.perf_counter()` averaged over all samples. No warmup run. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| ```bash | ||
| pip install -r requirements.txt | ||
| ``` | ||
|
|
||
| Install ONNX Runtime GenAI: | ||
|
|
||
| | Device | Install Command | | ||
| |--------|-----------------| | ||
| | CPU | `pip install onnxruntime-genai --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple` | | ||
| | GPU (CUDA) | `pip install onnxruntime-genai-cuda --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple` | | ||
|
|
||
| ## Steps | ||
|
|
||
| ### 1. Export & Optimize Models | ||
|
|
||
| **CPU (k_quant_mixed INT4 text + INT8 vision + FP32 embedding):** | ||
|
|
||
| ```bash | ||
| python optimize.py --config-dir cpu_and_mobile --device cpu --dtype f32 | ||
| ``` | ||
|
|
||
| **CUDA (k_quant_mixed INT4 text + INT8 vision + FP16 embedding):** | ||
|
|
||
| ```bash | ||
| python optimize.py --config-dir cuda --device gpu | ||
| ``` | ||
|
|
||
| **With local dequantized checkpoint (skips FP8 dequant):** | ||
|
|
||
| ```bash | ||
| python optimize.py --config-dir cpu_and_mobile --device cpu --model-path /path/to/Ministral-3-3B-dequantized | ||
| ``` | ||
|
|
||
| This runs: | ||
| - **Olive/ModelBuilder** for text decoder (GQA attention, YaRN RoPE, k_quant_mixed INT4) | ||
| - **Mobius** for vision encoder (Pixtral, dynamic H×W, 2D RoPE) and embedding (token + image fusion) | ||
| - **Olive INT8 quantization** on vision encoder (both CUDA and CPU) | ||
|
|
||
| Then generates `genai_config.json` and `processor_config.json` for the ORT GenAI runtime. | ||
|
|
||
| ### 2. Output Structure | ||
|
|
||
| ``` | ||
| cpu_and_mobile/models/ # or cuda/models/ | ||
| ├── decoder/ | ||
| │ ├── model.onnx # Text decoder (Mistral + YaRN) | ||
| │ └── model.onnx.data | ||
| ├── vision/ | ||
| │ ├── model.onnx # Pixtral vision encoder (FP16) | ||
| │ └── model.onnx.data | ||
| ├── embedding/ | ||
| │ ├── model.onnx # Embedding fusion model (FP16) | ||
| │ └── model.onnx.data | ||
| ├── genai_config.json # Runtime configuration | ||
| ├── processor_config.json # Pixtral image preprocessing | ||
| ├── tokenizer.json | ||
| └── tokenizer_config.json | ||
| ``` | ||
|
|
||
| ### 3. Run Inference | ||
|
|
||
| ```bash | ||
| # Text-only | ||
| python inference.py --prompt "What is the capital of France?" | ||
|
|
||
| # Image + text | ||
| python inference.py --image photo.jpg --prompt "Describe this image" | ||
|
|
||
| # Interactive mode | ||
| python inference.py --interactive | ||
|
|
||
| # CUDA model | ||
| python inference.py --model_path cuda/models --prompt "Hello" | ||
| ``` | ||
|
|
||
| Alternatively, use the built-in GenAI multimodal demo: | ||
|
|
||
| ```bash | ||
| python -m onnxruntime_genai.models.model_mm -m cpu_and_mobile/models --max_length 4096 | ||
| ``` | ||
|
|
||
| ### 4. Evaluate | ||
|
|
||
| Run the AI2D science diagram QA benchmark (see [Benchmark Results](#benchmark-results) for expected accuracy): | ||
|
|
||
| ```bash | ||
| # ONNX only (CPU) | ||
| python eval.py --device cpu --model_path cpu_and_mobile/models | ||
|
|
||
| # ONNX only (CUDA) | ||
| python eval.py --device cuda --model_path cuda/models | ||
|
|
||
| # PyTorch baseline (BF16 variant avoids FP8 kernel requirement) | ||
| python eval.py --skip_onnx --pytorch_model mistralai/Ministral-3-3B-Instruct-2512-BF16 --device cpu --num_samples 100 | ||
|
|
||
| # Compare ONNX vs PyTorch side-by-side | ||
| python eval.py --model_path cuda/models --pytorch_model mistralai/Ministral-3-3B-Instruct-2512-BF16 --num_samples 100 | ||
| ``` | ||
|
|
||
| > **Note:** The default HuggingFace checkpoint (`Ministral-3-3B-Instruct-2512`) uses FP8 weights, | ||
| > which require a specific CUDA kernel build. Use the `-BF16` variant for PyTorch baselines. | ||
|
|
||
| ## Directory Structure | ||
|
|
||
| ``` | ||
| mistralai-Ministral-3-3B-Instruct-2512/builtin/ | ||
| ├── cpu_and_mobile/ | ||
| │ ├── text.json # k_quant_mixed INT4 text decoder config (Olive/ModelBuilder) | ||
| │ └── vision.json # INT8 vision quantization (Olive, post-mobius) | ||
| ├── cuda/ | ||
| │ ├── text.json # k_quant_mixed INT4 text decoder config (Olive/ModelBuilder) | ||
| │ └── vision.json # INT8 vision quantization (Olive, post-mobius) | ||
| ├── optimize.py # Export orchestrator (Olive + Mobius) | ||
| ├── inference.py # ORT GenAI inference (text + VLM) | ||
| ├── eval.py # AI2D benchmark evaluation | ||
| ├── requirements.txt | ||
| ├── info.yml | ||
| └── README.md | ||
| ``` | ||
|
|
||
| > **Note:** Unlike Qwen VLM recipes (which use Olive for all 3 sub-models end-to-end), | ||
| > Ministral uses **mobius** for vision and embedding ONNX export, then **Olive** for | ||
| > INT8 quantization of vision. Embedding stays FP16 (or FP32 for CPU). | ||
|
|
||
| ## Differences from Qwen VLM Recipes | ||
|
|
||
| Qwen VLM recipes export all three sub-models through Olive using JSON configs | ||
| (`text.json`, `vision.json`). Each JSON defines a multi-pass | ||
| pipeline: PyTorch export → graph surgery → ORT fusion → quantization/FP16. | ||
|
|
||
| This recipe takes a different approach for **vision and embedding**: | ||
|
|
||
| | Component | Qwen | Ministral | Why | | ||
| |-----------|------|-----------|-----| | ||
| | Text decoder | Olive/ModelBuilder (`text.json`) | Olive/ModelBuilder (`text.json`) | Same — ModelBuilder handles GQA + quantization | | ||
| | Vision encoder | Olive: PyTorch export + 5-6 passes | **Mobius** export + Olive INT8 (`vision.json`) | Pixtral's dynamic image dims break `torch.onnx.export` | | ||
| | Embedding | Olive: PyTorch export + 5 passes | **Mobius** export (FP16/FP32, no quantization) | Olive's GatherBlockQuantized has data format bugs | | ||
|
|
||
| **Why does Ministral use mobius instead of Olive for export?** Mobius constructs | ||
| the ONNX graph declaratively rather than tracing through PyTorch. The resulting | ||
| models already contain the graph optimizations that Qwen's Olive passes spend | ||
| 5-6 steps creating: | ||
|
|
||
| - **Fused operators:** `MultiHeadAttention`, `SkipSimplifiedLayerNormalization`, | ||
| `RotaryEmbedding` — already present in mobius output (Qwen achieves these via | ||
| `OrtTransformersOptimization`) | ||
| - **FP16 weights:** all 840M vision params exported as FP16 directly (Qwen | ||
| converts from FP32 via `OnnxFloatToFloat16`) | ||
| - **Clean graph:** 0 Gemm nodes, 0 redundant Cast chains (Qwen cleans these | ||
| via `GemmToMatMulAdd` and `OnnxPeepholeOptimizer`) | ||
| - **No PyTorch export artifacts:** no `PackedAttentionToLoopMHA` surgery needed | ||
| since mobius doesn't go through dynamo | ||
|
|
||
| **What Olive still handles:** `vision.json` applies | ||
| `OnnxBlockWiseRtnQuantization` (INT8) to the mobius-exported FP16 vision model | ||
| for both CUDA and CPU targets. | ||
|
|
||
| **Why optimize.py has more lines (~400) than Qwen (~170):** | ||
|
|
||
| | Code section | Lines | Why it can't be JSON-driven | | ||
| |---|---|---| | ||
| | `export_vision_and_embedding()` | ~55 | Olive has no mobius integration; Pixtral's dynamic dims cause dynamo failures | | ||
| | `update_genai_config()` | ~150 | Olive generates decoder config only; VLM 3-model config + transforms-based processor_config has no Olive pass | | ||
| | `quantize_vision_and_embedding()` | ~25 | Post-export INT8 on pre-built ONNX (Olive JSON-driven, but needs orchestration) | | ||
| | `fix_tokenizer()` | ~15 | No Olive tokenizer patching pass | | ||
|
|
||
| The text decoder export (`text.json`) and INT8 quantization (`vision.json`) ARE Olive JSON-driven — identical to Qwen. | ||
|
|
||
| ## Known Limitations | ||
|
|
||
| - **CPU vision: language drift on some images.** The quantized vision encoder occasionally produces embeddings that cause the text decoder to respond in the wrong language (e.g., Chinese instead of English). This has been observed on specific test images and is a known artifact of vision quantization. INT8 significantly reduces this compared to INT4. | ||
| - **FP8 checkpoint requires special kernels.** The default HuggingFace checkpoint uses FP8 weights. Use the `-BF16` variant for PyTorch evaluation on machines without FP8 kernel support. | ||
| ## Notes | ||
|
|
||
| - **Multi-image supported.** The runtime supports variable-count multi-image inputs via PixtralImageSizes metadata. Requires onnxruntime-extensions ≥ PR #1050 and models exported with PixtralImageSizes in `processor_config.json`. | ||
|
|
||
| - **CPU pipeline**: Mobius exports FP16 as an intermediate format. Olive then quantizes vision to INT8. For CPU deployment, use `--dtype f32` so embedding outputs float32 natively (CPU EP promotes FP16 to FP32, which causes genai dtype mismatches). | ||
| - **CUDA pipeline**: Mobius exports FP16 directly for vision/embedding. Olive quantizes vision to INT8. Text decoder uses k_quant_mixed INT4 via ModelBuilder. | ||
| - The HuggingFace checkpoint uses FP8 quantized weights. The export pipeline dequantizes these automatically (`weight * weight_scale_inv`). | ||
| - The tokenizer uses `TokenizersBackend` class which genai doesn't support. The optimize script fixes this to `LlamaTokenizer`. | ||
| - Pixtral vision supports dynamic image sizes (multiples of 28, up to 1540×1540). | ||
| - The text decoder includes `llama_4_attn_scale` for long-context attention (>16K tokens). | ||
19 changes: 19 additions & 0 deletions
19
mistralai-Ministral-3-3B-Instruct-2512/builtin/cpu_and_mobile/text.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "mistralai/Ministral-3-3B-Instruct-2512" | ||
| }, | ||
| "passes": { | ||
| "convert": { | ||
| "type": "ModelBuilder", | ||
| "precision": "int4", | ||
| "int4_accuracy_level": 4, | ||
| "int4_algo_config": "k_quant_mixed", | ||
| "extra_options": { | ||
| "filename": "model.onnx" | ||
| } | ||
| } | ||
| }, | ||
| "no_artifacts": true, | ||
| "output_dir": "cpu_and_mobile/models/decoder" | ||
| } |
19 changes: 19 additions & 0 deletions
19
mistralai-Ministral-3-3B-Instruct-2512/builtin/cpu_and_mobile/vision.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "ONNXModel", | ||
| "model_path": "cpu_and_mobile/models/vision_encoder/model.onnx" | ||
| }, | ||
| "passes": { | ||
| "int8": { | ||
| "type": "OnnxBlockWiseRtnQuantization", | ||
| "bits": 8, | ||
| "block_size": 128, | ||
| "is_symmetric": true, | ||
| "accuracy_level": 4, | ||
| "save_as_external_data": true, | ||
| "external_data_name": "model.onnx.data" | ||
| } | ||
| }, | ||
| "no_artifacts": true, | ||
| "output_dir": "cpu_and_mobile/models/vision" | ||
| } |
16 changes: 16 additions & 0 deletions
16
mistralai-Ministral-3-3B-Instruct-2512/builtin/cpu_and_mobile/vision_embedding_export.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "mistralai/Ministral-3-3B-Instruct-2512" | ||
| }, | ||
| "passes": { | ||
| "export": { | ||
| "type": "MobiusBuilder", | ||
| "precision": "fp32", | ||
| "runtime": "none", | ||
| "components_to_export": ["vision_encoder", "embedding"] | ||
| } | ||
| }, | ||
| "no_artifacts": true, | ||
| "output_dir": "cpu_and_mobile/models" | ||
| } |
32 changes: 32 additions & 0 deletions
32
mistralai-Ministral-3-3B-Instruct-2512/builtin/cuda/text.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "mistralai/Ministral-3-3B-Instruct-2512" | ||
| }, | ||
| "passes": { | ||
| "convert": { | ||
| "type": "ModelBuilder", | ||
| "precision": "int4", | ||
| "int4_accuracy_level": 4, | ||
| "int4_algo_config": "k_quant_mixed", | ||
| "extra_options": { | ||
| "filename": "model.onnx" | ||
| } | ||
| } | ||
| }, | ||
| "engine": { | ||
| "target": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "gpu", | ||
| "execution_providers": [ | ||
| "CUDAExecutionProvider" | ||
| ] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "no_artifacts": true, | ||
| "output_dir": "cuda/models/decoder" | ||
| } |
32 changes: 32 additions & 0 deletions
32
mistralai-Ministral-3-3B-Instruct-2512/builtin/cuda/vision.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "ONNXModel", | ||
| "model_path": "cuda/models/vision_encoder/model.onnx" | ||
| }, | ||
| "passes": { | ||
| "int8": { | ||
| "type": "OnnxBlockWiseRtnQuantization", | ||
| "bits": 8, | ||
| "block_size": 128, | ||
| "is_symmetric": true, | ||
| "accuracy_level": 4, | ||
| "save_as_external_data": true, | ||
| "external_data_name": "model.onnx.data" | ||
| } | ||
| }, | ||
| "engine": { | ||
| "target": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "gpu", | ||
| "execution_providers": [ | ||
| "CUDAExecutionProvider" | ||
| ] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "no_artifacts": true, | ||
| "output_dir": "cuda/models/vision" | ||
| } |
29 changes: 29 additions & 0 deletions
29
mistralai-Ministral-3-3B-Instruct-2512/builtin/cuda/vision_embedding_export.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| { | ||
| "input_model": { | ||
| "type": "HfModel", | ||
| "model_path": "mistralai/Ministral-3-3B-Instruct-2512" | ||
| }, | ||
| "passes": { | ||
| "export": { | ||
| "type": "MobiusBuilder", | ||
| "precision": "fp16", | ||
| "runtime": "none", | ||
| "components_to_export": ["vision_encoder", "embedding"] | ||
| } | ||
| }, | ||
| "engine": { | ||
| "target": { | ||
| "type": "LocalSystem", | ||
| "accelerators": [ | ||
| { | ||
| "device": "gpu", | ||
| "execution_providers": [ | ||
| "CUDAExecutionProvider" | ||
| ] | ||
| } | ||
| ] | ||
| } | ||
| }, | ||
| "no_artifacts": true, | ||
| "output_dir": "cuda/models" | ||
| } |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in commit ea2a218. Removed
--dtype f32from the CPU command example (the flag is a no-op for the Olive-managed export). Updated the Notes section to explicitly state that--dtypeis accepted for backward compatibility but does not control export precision — precision is set in the JSON config files (vision_embedding_export.json).