fix(puzzletron): use prebuilt KD dataset to avoid 136GB download by TheSabari07 · Pull Request #1726 · NVIDIA/Model-Optimizer

TheSabari07 · 2026-06-15T10:19:56Z

What does this PR do?

Type of change: Bug fix, documentation

This PR updates the Puzzletron dataset preparation flow to use the already published
prebuilt dataset nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2 by default,
avoiding the need to download the full raw
nvidia/Nemotron-Post-Training-Dataset-v2 dataset (~136 GB) just to filter it
down to the same ~2.6 GB result.

Changes included:

Add PREBUILT_KD_DATASET constant in prepare_dataset.py
Short-circuit dataset preparation when dataset_name matches the prebuilt dataset,
loading it directly and skipping the download + filtering pipeline
Update 8 Puzzletron example configs to use the prebuilt dataset path by default
Update the Puzzletron README to document the default ~3 GB path and clarify that
the raw ~136 GB path is still available if users want to reproduce preprocessing

Usage

Default lightweight path:

python -m modelopt.torch.puzzletron.dataset.prepare_dataset \
  --dataset_name nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2 \
  --output_dir path/to/Puzzle-KD-Nemotron-Post-Training-Dataset-v2

Raw dataset path (existing behavior, still supported):

python -m modelopt.torch.puzzletron.dataset.prepare_dataset \
  --dataset_name nvidia/Nemotron-Post-Training-Dataset-v2 \
  --output_dir path/to/Nemotron-Post-Training-Dataset-v2

Testing

Ran pre-commit run --all-files
Most hooks passed successfully
Local pre-commit mypy reported unrelated existing errors in:
- modelopt/torch/opt/config_loader.py
- modelopt/recipe/loader.py
Verified this change separately with a local mock-based test:
- prebuilt dataset path correctly loads and saves directly
- original raw dataset path remains untouched

Before your PR is "Ready for review"

Is this change backward compatible?: ✅
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
Did you write any new necessary tests?: N/A
Did you update Changelog?: N/A
Did you get Claude approval on this PR?: N/A

Additional Information

This change preserves the original raw-dataset workflow for users who explicitly want
to regenerate the filtered dataset from scratch, while making the default example flow
much lighter and easier to use.

Summary by CodeRabbit

Release Notes

Documentation
- Updated setup instructions to use a prebuilt, optimized dataset by default, simplifying the model compression workflow.
Chores
- Updated model compression configurations across multiple examples to use the prebuilt dataset.
- Enhanced dataset preparation to support prebuilt dataset handling for more efficient setup.

The prepare_dataset.py script previously downloaded the full Nemotron-Post-Training-Dataset-v2 (~136GB) to filter it down to a 2.6GB subset. That processed subset is already published as nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2. - Add PREBUILT_KD_DATASET constant in prepare_dataset.py - Short-circuit to load the prebuilt dataset directly when dataset_name matches, skipping download and filtering - Update all 8 example configs to use the prebuilt dataset - Clarify disk usage in README with note about the raw path Fixes NVIDIA#1658 Signed-off-by: Sabari07 <sabursd18@gmail.com>

copy-pr-bot · 2026-06-15T10:20:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-15T10:20:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ea096107-78a5-45d5-89e9-72e6ff383d65

📥 Commits

Reviewing files that changed from the base of the PR and between 9f6e8fd and d047e33.

📒 Files selected for processing (10)

examples/puzzletron/README.md
examples/puzzletron/configs/gptoss-20b_remove_experts_memory/gptoss-20b_remove_experts_memory.yaml
examples/puzzletron/configs/llama-3_1-8B_pruneffn_memory/llama-3_1-8B_pruneffn_memory.yaml
examples/puzzletron/configs/llama-3_1-8B_pruneffn_runtime/llama-3_1-8B_pruneffn_runtime.yaml
examples/puzzletron/configs/llama-3_2-3B_pruneffn_memory/llama-3_2-3B_pruneffn_memory.yaml
examples/puzzletron/configs/mistral-small-24b-instruct-2501_pruneffn_memory/mistral-small-24b-instruct-2501_pruneffn_memory.yaml
examples/puzzletron/configs/nemotron-nano-12b-v2/nemotron_nano_12b_v2_pruneffn_memory.yaml
examples/puzzletron/configs/qwen2_5_7b_instruct_pruneffn_memory/qwen2_5_7b_instruct_pruneffn_memory.yaml
examples/puzzletron/configs/qwen3-8b_pruneffn_memory/qwen3_8b_pruneffn_memory.yaml
modelopt/torch/puzzletron/dataset/prepare_dataset.py

📝 Walkthrough

Walkthrough

Switches Puzzletron's default dataset from the raw 136GB Nemotron-Post-Training-Dataset-v2 to the prebuilt ~3GB Puzzle-KD-Nemotron-Post-Training-Dataset-v2. Adds a PREBUILT_KD_DATASET constant and an early-return branch in process_and_save_dataset that loads and saves the prebuilt dataset directly without filtering or splitting. Updates dataset_path in all example YAML configs and rewrites the README preparation instructions accordingly.

Changes

Puzzletron Puzzle-KD Dataset Adoption

Layer / File(s)	Summary
`PREBUILT_KD_DATASET` constant and early-return logic `modelopt/torch/puzzletron/dataset/prepare_dataset.py`	Adds `PREBUILT_KD_DATASET` module-level constant and an early-return branch in `process_and_save_dataset` that detects the prebuilt dataset name, loads it as-is, saves to `output_dir`, and skips the filtering and deterministic train/valid split pipeline.
YAML config `dataset_path` updates and README docs `examples/puzzletron/README.md`, `examples/puzzletron/configs/*/...yaml`	Updates `dataset_path` from `Nemotron-Post-Training-Dataset-v2` to `Puzzle-KD-Nemotron-Post-Training-Dataset-v2` in all seven example configs. Rewrites the README dataset preparation section to default to the prebuilt Puzzle-KD dataset and adds a note describing the optional raw-dataset alternative via `--dataset_name`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically identifies the main change: using a prebuilt KD dataset to eliminate a 136GB download requirement.
Linked Issues check	✅ Passed	The PR fully addresses both objectives from issue `#1658`: it provides documentation clarification and reduces disk requirement from 136GB to ~3GB by defaulting to the prebuilt dataset.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to resolving issue `#1658`: README updates, configuration updates to use prebuilt dataset, and dataset preparation logic modifications.
Security Anti-Patterns	✅ Passed	No security anti-patterns found. PR adds safe dataset-loading logic with no torch.load/numpy.load/eval/exec issues, hardcoded secrets, nosec comments, or new dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kevalmorabia97 · 2026-06-15T11:17:38Z

/ok to test d047e33

kevalmorabia97 · 2026-06-15T11:18:15Z

@TheSabari07 have you verified if output dataset from puzzle KD dataset is in similar format as with the previous Nemotron-post-training-dataset-v2?

codecov · 2026-06-15T11:26:04Z

Codecov Report

❌ Patch coverage is 12.50000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.55%. Comparing base (9f6e8fd) to head (d047e33).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...delopt/torch/puzzletron/dataset/prepare_dataset.py	12.50%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1726      +/-   ##
==========================================
- Coverage   77.12%   76.55%   -0.58%     
==========================================
  Files         511      511              
  Lines       56247    56255       +8     
==========================================
- Hits        43381    43064     -317     
- Misses      12866    13191     +325

Flag	Coverage Δ
examples	`41.84% <12.50%> (-0.11%)`	⬇️
gpu	`57.77% <12.50%> (-0.60%)`	⬇️
regression	`14.69% <0.00%> (+0.06%)`	⬆️
unit	`54.38% <12.50%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TheSabari07 · 2026-06-15T13:01:35Z

@TheSabari07 have you verified if output dataset from puzzle KD dataset is in similar format as with the previous Nemotron-post-training-dataset-v2?

Hi @kevalmorabia97,

Yes, I verified the prebuilt Puzzle-KD dataset. It has the expected train and validation splits and includes all required columns (uuid, license, generator, version, category, reasoning, and messages).

It matches the expected structure, so it should work with the existing Puzzletron flow.

TheSabari07 · 2026-06-15T13:59:30Z

@danielkorzekwa @kevalmorabia97

Thank you for giving me the opportunity to contribute. I appreciate your review and support, and I'm happy to be a part of improving the project. Looking forward to contributing more.

#1558 #1670 #1662 #1677 #1327 #1673 #1676 #1687 #1678 #1691 #1697 #1702 #1704 #1726 #1729 (#1734) ## Cherry-picked PRs - #1648 - #1650 - #1594 - #1269 - #1326 - #1652 - #1651 - #1601 - #1653 - #1558 - #1670 - #1662 - #1677 - #1327 - #1673 - #1676 - #1687 - #1678 - #1691 - #1697 - #1702 - #1704 - #1726 - #1729  ## Summary by CodeRabbit ## Release Notes * **New Features** * Added Alpamayo quantization example with FP8/NVFP4 export support. * Introduced FastGen DMD2 distillation library for Qwen-Image text-to-image optimization. * Added lossless MXFP4-to-NVFP4 weight casting for DeepSeek models. * Expanded PTQ recipes with new NVFP4 variants (MLP-only, experts-only, weight-only). * Enhanced sparse attention calibration and export capabilities. * **Documentation** * Added end-to-end Nemotron-3 optimization tutorial and comprehensive PTQ recipe guide. * Updated example READMEs and CHANGELOG with latest optimization capabilities. * **Bug Fixes** * Fixed sparse attention configuration export schema. * Improved KV cache reuse settings for context logits generation.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Gwenaelle Cunha Sergio <gcunhasergio@nvidia.com> Signed-off-by: Sabari07 <sabursd18@gmail.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Rohan Joshi <rohjoshi@nvidia.com> Co-authored-by: jingyu-ml <108295447+jingyu-ml@users.noreply.github.com> Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Chenhan D. Yu <5185878+ChenhanYu@users.noreply.github.com> Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: kinjalpatel27 <31936134+kinjalpatel27@users.noreply.github.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: Zhiyu <zhiyuc@nvidia.com> Co-authored-by: Gwena Cunha <4861122+gcunhase@users.noreply.github.com> Co-authored-by: Sabari07 <sabursd18@gmail.com> Co-authored-by: Sepehr Sameni <ssameni@nvidia.com>

TheSabari07 requested a review from a team as a code owner June 15, 2026 10:19

coderabbitai Bot approved these changes Jun 15, 2026

View reviewed changes

TheSabari07 mentioned this pull request Jun 15, 2026

140GB is needed on hard disk instead of 2.62GB for downloading a dataset for puzzletron algorithm #1658

Closed

kevalmorabia97 approved these changes Jun 15, 2026

View reviewed changes

kevalmorabia97 merged commit 95d4e12 into NVIDIA:main Jun 15, 2026
66 of 69 checks passed

kevalmorabia97 added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 15, 2026

kevalmorabia97 mentioned this pull request Jun 15, 2026

[Cherry-pick] PRs #1648 #1650 #1594 #1269 #1326 #1652 #1651 #1601 #1653 #1558 #1670 #1662 #1677 #1327 #1673 #1676 #1687 #1678 #1691 #1697 #1702 #1704 #1726 #1729 #1734

Merged

kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(puzzletron): use prebuilt KD dataset to avoid 136GB download#1726

fix(puzzletron): use prebuilt KD dataset to avoid 136GB download#1726
kevalmorabia97 merged 1 commit into
NVIDIA:mainfrom
TheSabari07:fix/puzzletron-dataset-download

TheSabari07 commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

kevalmorabia97 commented Jun 15, 2026

Uh oh!

kevalmorabia97 commented Jun 15, 2026

Uh oh!

codecov Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

TheSabari07 commented Jun 15, 2026

Uh oh!

Uh oh!

TheSabari07 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheSabari07 commented Jun 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Jun 15, 2026

Uh oh!

coderabbitai Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

kevalmorabia97 commented Jun 15, 2026

Uh oh!

kevalmorabia97 commented Jun 15, 2026

Uh oh!

codecov Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TheSabari07 commented Jun 15, 2026

Uh oh!

Uh oh!

TheSabari07 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheSabari07 commented Jun 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 15, 2026 •

edited

Loading

codecov Bot commented Jun 15, 2026 •

edited

Loading