fix(puzzletron): use prebuilt KD dataset to avoid 136GB download#1726
Conversation
The prepare_dataset.py script previously downloaded the full Nemotron-Post-Training-Dataset-v2 (~136GB) to filter it down to a 2.6GB subset. That processed subset is already published as nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2. - Add PREBUILT_KD_DATASET constant in prepare_dataset.py - Short-circuit to load the prebuilt dataset directly when dataset_name matches, skipping download and filtering - Update all 8 example configs to use the prebuilt dataset - Clarify disk usage in README with note about the raw path Fixes NVIDIA#1658 Signed-off-by: Sabari07 <sabursd18@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughSwitches Puzzletron's default dataset from the raw 136GB ChangesPuzzletron Puzzle-KD Dataset Adoption
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/ok to test d047e33 |
|
@TheSabari07 have you verified if output dataset from puzzle KD dataset is in similar format as with the previous Nemotron-post-training-dataset-v2? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1726 +/- ##
==========================================
- Coverage 77.12% 76.55% -0.58%
==========================================
Files 511 511
Lines 56247 56255 +8
==========================================
- Hits 43381 43064 -317
- Misses 12866 13191 +325
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Hi @kevalmorabia97, Yes, I verified the prebuilt Puzzle-KD dataset. It has the expected train and validation splits and includes all required columns (uuid, license, generator, version, category, reasoning, and messages). It matches the expected structure, so it should work with the existing Puzzletron flow. |
|
@danielkorzekwa @kevalmorabia97 Thank you for giving me the opportunity to contribute. I appreciate your review and support, and I'm happy to be a part of improving the project. Looking forward to contributing more. |
#1558 #1670 #1662 #1677 #1327 #1673 #1676 #1687 #1678 #1691 #1697 #1702 #1704 #1726 #1729 (#1734) ## Cherry-picked PRs - #1648 - #1650 - #1594 - #1269 - #1326 - #1652 - #1651 - #1601 - #1653 - #1558 - #1670 - #1662 - #1677 - #1327 - #1673 - #1676 - #1687 - #1678 - #1691 - #1697 - #1702 - #1704 - #1726 - #1729 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added Alpamayo quantization example with FP8/NVFP4 export support. * Introduced FastGen DMD2 distillation library for Qwen-Image text-to-image optimization. * Added lossless MXFP4-to-NVFP4 weight casting for DeepSeek models. * Expanded PTQ recipes with new NVFP4 variants (MLP-only, experts-only, weight-only). * Enhanced sparse attention calibration and export capabilities. * **Documentation** * Added end-to-end Nemotron-3 optimization tutorial and comprehensive PTQ recipe guide. * Updated example READMEs and CHANGELOG with latest optimization capabilities. * **Bug Fixes** * Fixed sparse attention configuration export schema. * Improved KV cache reuse settings for context logits generation. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Rohan Joshi <rohjoshi@nvidia.com> Signed-off-by: Jingyu Xin <jingyux@nvidia.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Gwenaelle Cunha Sergio <gcunhasergio@nvidia.com> Signed-off-by: Sabari07 <sabursd18@gmail.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Co-authored-by: Rohan Joshi <rohjoshi@nvidia.com> Co-authored-by: jingyu-ml <108295447+jingyu-ml@users.noreply.github.com> Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Chenhan D. Yu <5185878+ChenhanYu@users.noreply.github.com> Co-authored-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: kinjalpatel27 <31936134+kinjalpatel27@users.noreply.github.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: Zhiyu <zhiyuc@nvidia.com> Co-authored-by: Gwena Cunha <4861122+gcunhase@users.noreply.github.com> Co-authored-by: Sabari07 <sabursd18@gmail.com> Co-authored-by: Sepehr Sameni <ssameni@nvidia.com>
Fixes #1658
What does this PR do?
Type of change: Bug fix, documentation
This PR updates the Puzzletron dataset preparation flow to use the already published
prebuilt dataset
nvidia/Puzzle-KD-Nemotron-Post-Training-Dataset-v2by default,avoiding the need to download the full raw
nvidia/Nemotron-Post-Training-Dataset-v2dataset (~136 GB) just to filter itdown to the same ~2.6 GB result.
Changes included:
PREBUILT_KD_DATASETconstant inprepare_dataset.pydataset_namematches the prebuilt dataset,loading it directly and skipping the download + filtering pipeline
the raw ~136 GB path is still available if users want to reproduce preprocessing
Usage
Default lightweight path:
Raw dataset path (existing behavior, still supported):
Testing
pre-commit run --all-filesmypyreported unrelated existing errors in:modelopt/torch/opt/config_loader.pymodelopt/recipe/loader.pyBefore your PR is "Ready for review"
CONTRIBUTING.md: N/AAdditional Information
This change preserves the original raw-dataset workflow for users who explicitly want
to regenerate the filtered dataset from scratch, while making the default example flow
much lighter and easier to use.
Summary by CodeRabbit
Release Notes
Documentation
Chores