Add Lightning Model Registry fallback for gated HF tokenizer tests#2262
Draft
bhimrazy wants to merge 9 commits into
Draft
Add Lightning Model Registry fallback for gated HF tokenizer tests#2262bhimrazy wants to merge 9 commits into
bhimrazy wants to merge 9 commits into
Conversation
…epos Fork PRs run under `pull_request` and have no `HF_TOKEN`, so tokenizer parity tests against gated-but-public HF repos (Llama, Gemma, etc.) fail there. Add a CI-safe resolver that downloads tokenizer/config files from Hugging Face, falls back to a public Lightning Model Registry mirror when a repo is gated and no token is available (the lightning_sdk client uses an anonymous guest login), and skips with a clear message when neither source works. Internal/main runs keep full HF coverage via HF_TOKEN. - tests/_fixtures.py: resolver + gated repo -> registry fixture map - tests/test_tokenizer.py: use the resolver instead of direct HF downloads - tests/upload_gated_tokenizer_fixtures.py: maintainer-only mirror upload script - pyproject.toml: add litmodels to the test extra
Use the verified `lightning-ai/oss-litgpt` teamspace for tokenizer/config fixtures, and rewrite the maintainer-only mirroring script as tests/publish_fixtures.py, parallelizing repo uploads with litdata.map.
9d3ba91 to
6c9295d
Compare
for more information, see https://pre-commit.ci
Adds [fixtures] markers for HF vs registry-fallback resolution and runs the scoped tokenizer job with -rs/-s so CI shows which path each repo took and why any skip happens. Temporary, paired with the test_tokenizer.py scoping.
for more information, see https://pre-commit.ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fork PR CI currently breaks on one test.
test_tokenizer_against_hfdownloads tokenizer/config files for every supported model directly from Hugging Face.About 40 of those repos are gated behind license acceptance (Llama, Gemma, Falcon-180B, Mistral-Large, ...). Since #2253 removed
pull_request_target, fork PRs have noHF_TOKEN, so a single gated repo fails the whole CPU job.This PR adds a fallback so the test resolves tokenizer/config files like this:
tests/_fixtures.pyimplements this fallback and is used bytest_tokenizer.py.tests/publish_fixtures.pyis a maintainer-only script that mirrors tokenizer/config files (never weights) for the gated repos to thelightning-ai/oss-litgptregistry teamspace.A spot-check of one published fixture (
meta-llama/Llama-2-7b-hf) confirms it resolves anonymously via the registry — noHF_TOKENor Lightning login needed.Internal PRs and
mainruns are unchanged — they keep usingHF_TOKENagainst Hugging Face directly.No breaking changes.
Before submitting
pull_request_targettrigger from CPU tests workflow #2253, which flagged the need for a better solution than dropping fork secrets.test_tokenizer.py.PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist