feat: video + audio understanding GRPO training recipe #2823
Open
yuekaizhang wants to merge 20 commits into
Open
feat: video + audio understanding GRPO training recipe #2823yuekaizhang wants to merge 20 commits into
yuekaizhang wants to merge 20 commits into
Conversation
- vlm_hf_data_processor accepts the daily-omni task and a `video` content type; loads frames via `transformers.video_utils.load_video` and emits `vllm_videos` alongside images/audios - eval_collate_fn forwards `vllm_videos` and `_run_env_eval_impl` attaches them to vLLM `multi_modal_data["video"]` - DailyOmniEvalDataset wraps DailyOmniDataset for the eval registry; strips the upstream "single letter only" instruction at the eval boundary so the prompt template alone dictates output format (SFT path untouched) - examples/prompts/daily_omni.txt + examples/configs/evals/daily_omni.yaml drive Qwen/Qwen2.5-Omni-3B inference with `<answer>` formatting - new `exact_alnum_with_fallback` reward mirrors HumanOmniV2 semantics (whole response treated as the answer when the tag is missing); the existing `exact_alnum` strict reward is unchanged Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Adds a NeMo-RL GRPO recipe that fine-tunes Qwen/Qwen2.5-Omni-3B on the
HumanOmniV2 PhilipC/IntentTrain audio-visual intent-recognition dataset
and validates on PhilipC/IntentBench. Each prompt feeds the Qwen2.5-Omni
processor both the video stream (16 frames) and the audio track decoded
from the same file at 16 kHz mono, with use_audio_in_video=True propagated
through apply_chat_template and through vLLM rollout's mm_processor_kwargs
so audio and video tokens are aligned.
New IntentDataset class downloads each HF repo via snapshot_download,
extracts videos.zip once (sentinel-guarded), filters manifests to
problem_type == "multiple choice", and emits messages with video path +
decord-decoded audio array + text prompt. Two registry entries
(intent-train, intent-bench) share the implementation.
Framework wiring:
- nemo_rl/data/processors.py: pass use_audio_in_video=True through
apply_chat_template for intent tasks.
- nemo_rl/models/generation/vllm/utils.py: add multi_modal_data["video"]
forwarding alongside the existing image/audio keys, and set
mm_processor_kwargs={"use_audio_in_video": True} for intent rollouts
that carry both modalities.
- nemo_rl/evals/eval.py: same mm_processor_kwargs injection on the eval
side.
- nemo_rl/data/collate_fn.py: eval_collate_fn now propagates task_name so
the eval prompt builder can detect intent samples.
Recipe ships with examples/configs/intent_grpo_3B_megatron.yaml (mirrors
audio_grpo_3B_megatron.yaml: limit_mm_per_prompt {video:1, audio:1};
num_frames:16; max_total_sequence_length:8192; sequence_packing off;
mm_processor_cache_gb:0; format(0.2)+exact_alnum(0.8)) plus a docs guide
docs/guides/grpo-intent.md linked from docs/index.md.
Smoke validation (real GPU run) is the next round's task.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Qwen2.5-Omni's apply_chat_template path silently swallows
use_audio_in_video=True: when both an explicit {type:audio} content item
and a {type:video} item are present in the message, the chat template
emits two audio placeholders but the processor's audio_lengths iterator
sees only one entry and raises StopIteration in
replace_multimodal_special_tokens.
Switch IntentDataset to emit only video + text content items, and have
vlm_hf_data_processor extract the audio track from the video file inline
(via decord.AudioReader) when the task is intent-train/intent-bench.
For these tasks the processor is invoked manually in two steps:
text = apply_chat_template(tokenize=False)
inputs = processor(text=[text], videos=videos, audio=audios,
use_audio_in_video=True, return_tensors="pt")
This is the path HumanOmniV2 uses and the only path that produces both
audio features (input_features, feature_attention_mask) and video
features (pixel_values_videos, video_grid_thw, video_second_per_grid)
without tripping the duplicate-placeholder StopIteration.
Verified locally on a real IntentTrain sample: token_ids shape 7325
(under the YAML's 8192 budget), loss_multiplier=1.0, vllm_videos and
vllm_audios both populated.
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
vLLM's Qwen2.5-Omni multimodal pipeline asserts when the rendered text
prompt does not contain audio placeholder tokens but mm_items["audio"]
is provided -- "Failed to apply prompt replacement for mm_items['audio'][0]"
inside the rollout. The rendered text we put into vllm_content comes from
processor.apply_chat_template(tokenize=False, add_generation_prompt=True),
which only emits <|VIDEO|> placeholders for {type:video}; the audio
placeholders are inserted later by Qwen2.5-Omni's custom processor when
text is re-tokenized with use_audio_in_video=True.
Switch the rollout-time prompt format to prompt_token_ids for intent
tasks: the policy-side input_ids was built from the same two-step
processor invocation as vllm_audios/vllm_videos, so it already carries
both audio AND video placeholder tokens at the correct positions.
mm_processor_kwargs is no longer needed because vLLM does not
re-tokenize.
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
…udio_in_video)
The two-step processor.apply_chat_template + processor() path with
use_audio_in_video=True works for the policy-side message_log but vLLM's
Qwen2.5-Omni multimodal pipeline rejects it at rollout with
"Failed to apply prompt replacement for mm_items['audio'][0]" because
the rendered prompt text only carries <|VIDEO|> placeholders, not the
audio placeholders that vLLM expects to find before consuming an
mm_items["audio"] entry. Forcing prompt_token_ids did not help: vLLM
still applies its own multimodal prompt replacement.
Switch the IntentTrain/IntentBench data pipeline to feed audio and video
to the chat template as independent {type:audio} / {type:video} content
items. The chat template now renders both <|VIDEO|> and <|AUDIO|>
placeholders into the prompt, the existing single-step
apply_chat_template(tokenize=True, return_dict=True) path produces both
audio features (input_features, feature_attention_mask) and video
features (pixel_values_videos, video_grid_thw, video_second_per_grid),
and vLLM accepts mm_items["audio"] + mm_items["video"] without the
duplicate-placeholder error. The model still receives both modalities;
the only thing dropped is the explicit time-alignment hint, which we
defer to a follow-up since v1's vLLM stack does not support that path
for Qwen2.5-Omni.
Verified locally on a real IntentTrain sample: token_ids len=7325 under
the 8192 max, vllm_videos len=1, vllm_audios len=1, prompt has both
<|VIDEO|> and <|AUDIO|>.
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
…treams path
Round 1 ended on the explicit ``{type:audio}`` + ``{type:video}``
multimodal contract (Qwen2.5-Omni's chat template renders <|VIDEO|> +
<|AUDIO|> placeholders independently and vLLM rolls out both modalities
without ``use_audio_in_video=True``). The dataset module docstring,
class docstring, YAML header, and the public docs guide all still
described the abandoned ``use_audio_in_video`` / ``mm_processor_kwargs``
path; rewrite them to match the verified implementation and document
why the alignment hint is intentionally not used in v1.
Also remove the "exercised end to end" claim from the docs guide and
replace it with the actual smoke configuration plus the
``HF_HUB_OFFLINE=1`` requirement that surfaced when the Megatron
tokenizer worker hit a network read timeout.
Add regression coverage for the contract before closing the round:
- tests/unit/models/generation/test_vllm_utils.py:
``test_vllm_utils_vlm_with_audio_and_video_intent_path`` builds a
BatchedDataDict with ``vllm_videos`` AND ``vllm_audios`` plus
``task_name=["intent-train", "intent-bench"]`` and asserts
``multi_modal_data`` carries both ``video`` and ``audio`` keys for
every prompt and that ``mm_processor_kwargs`` is NOT set.
- tests/unit/data/datasets/test_intent_dataset.py: fabricates a fake
IntentTrain HF snapshot (manifest + .mp4 with audio + sentinel),
monkeypatches ``snapshot_download`` and ``get_huggingface_cache_path``,
and asserts every yielded sample emits exactly one type=video, one
type=audio (np.float32 1-D array), and one type=text content item.
``free-form`` samples are dropped by the allow-list.
Both new tests fail if a future change reverts the contract (drops
``vllm_audios`` from format_prompt_for_vllm_generation, or restores the
``use_audio_in_video=True`` / single-stream path).
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Replace the stale "joint via use_audio_in_video=True" cell in the
hyperparameter table with the actual independent <|VIDEO|> / <|AUDIO|>
placeholder shape that landed in code. Replace the previous overclaiming
Results section with the exact smoke command Round 3 executed end to end
on 4 x H100 80GB:
- HF_HUB_OFFLINE=1 / TRANSFORMERS_OFFLINE=1 to avoid the Megatron
tokenizer worker's AutoTokenizer.from_pretrained timeout.
- policy.tokenizer.video.num_frames=4 + max_total_sequence_length=4096
so the multimodal forward fits inside the GPU budget vLLM leaves
resident.
- policy.megatron_cfg.activation_checkpointing=true to drop activation
memory below the same budget.
- policy.generation.vllm_cfg.gpu_memory_utilization=0.5 so vLLM's KV
cache does not crowd Megatron training out of the GPU.
- policy.logprob_batch_size=1 to match the per-DP-rank slice when
train_global_batch_size=4 on 4 GPUs (the YAML default of 4 trips
"Data dict size (1) is not a multiple of microbatch size").
- cluster.gpus_per_node=4 so global_batch_size=4 satisfies Megatron's
divisibility assertion against data-parallel size.
Document the runtime evidence the run produced:
- val_at_start validation reached the IntentBench loader and reported
"Accuracy: 0.0000, response length 2.0 tokens, 4 samples processed".
- Steps 1 and 2 trained and saved checkpoints under
results/intent_grpo_3B_megatron/step_{1,2}/policy/weights.
- convert_megatron_to_hf.py (with --extra mcore) wrote
results/intent_grpo_3B_megatron/step_2/hf/{config.json,
model-*.safetensors, model.safetensors.index.json,
chat_template.jinja, generation_config.json} and printed
"All tensors from the original checkpoint were written."
- format_prompt_for_vllm_generation produces multi_modal_data keys
['audio','video'] with video=(N,H,W,3) ndarray and audio=(np_array,
16000), with mm_processor_kwargs absent, and the rendered prompt
contains both <|VIDEO|> and <|AUDIO|> placeholders.
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
VLMVerifyWorker previously collapsed the configured reward functions into
a single combined scalar via combine_reward_functions, so the validation
loop in nemo_rl/algorithms/grpo.py::validate could only log a single
"accuracy" number and the per-sample rewards JSONL stored only the
combined value. The grpo-intent guide promised "validation reward and
answer-correctness reward signal" in wandb / tensorboard but those
component metrics were never actually emitted.
Restructure so the per-function scores survive end to end:
- nemo_rl/environments/vlm_environment.py: factor reward construction
into _build_named_reward_functions returning (name, fn, weight)
triples. VLMVerifyWorker keeps a verify() shim for back-compat (still
returning the combined scalar) and adds verify_with_components which
returns both the combined list AND a per-sample list of weighted
per-function scores in a stable order. VLMEnvironment.step now calls
verify_with_components and returns rewards as an (N, K) tensor of
weighted components -- summing along dim=1 reproduces the historical
scalar total_reward GRPO uses for advantage computation, and the
rollout's existing multi-reward path (run_multi_turn_rollout) already
promotes the K columns to reward1, reward2, ... batch keys when K>1.
Adds a Ray-callable reward_component_names() accessor so callers can
map column index back to the configured name.
- nemo_rl/algorithms/grpo.py::validate: when reward<i> columns are
present on the val batch, fetch the human names via the env's
reward_component_names accessor, accumulate per-component sample
values across val batches, compute means, surface them as
reward/<name> entries in val_metrics, print them in the validation
summary block, and write per-sample component values into the
val_data_step{N}.jsonl artifact alongside the combined rewards.
For envs that still return a 1-D (N,) rewards tensor (math, retriever,
sliding_puzzle, code_jaccard, reward_model, nemo_gym, ...) nothing
changes: the rollout single-reward branch keeps total_reward as before
and the per-component plumbing stays inert.
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
…metrics
Round 4 implemented the per-component reward emission in the VLM env +
GRPO validate (commit b49177f6). The grpo-intent guide's Results
section previously promised "validation reward and answer-correctness
reward signal" but did not enumerate the concrete metric keys; replace
that wording with the exact artifacts the validation loop now writes:
- Stdout block: Accuracy, Average response length, Samples processed,
and a "Per-component reward (weighted):" sub-block listing one line
per configured reward function (e.g. format, exact_alnum). Each
listed value is weighted, so summing the components reproduces the
combined reward.
- val_metrics handed off to wandb / tensorboard: accuracy, avg_length,
and reward/<name> entries (e.g. reward/format, reward/exact_alnum).
- val_data_step{N}.jsonl: each row gets the existing content / idx /
rewards columns plus one new column per reward component
(reward/format, reward/exact_alnum) carrying per-sample weighted
scores so eyeballing or plotting can split format vs answer-
correctness signal directly.
Also documents the implementation seam so future readers know that
the (N, K) rewards tensor from the env flows through the rollout's
existing multi-reward path and that VLMEnvironment.reward_component_names()
is the Ray-callable accessor used by the validation loop.
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
…ootstrap GRPO
Initial training showed both validation rewards stuck at 0 because the
Qwen2.5-Omni-3B base emits a bare letter (e.g. "B") instead of
"<answer>B</answer>", so neither reward function ever fires:
* format_reward only awards points for <think> and <answer> tags.
* exact_alnum_reward extracts the content of <answer></answer> and
returns 0 if the tag is missing entirely.
Two fixes:
1. Strengthen the per-problem-type instruction in IntentDataset so the
prompt explicitly asks the model to first reason between <think> </think>
tags and then commit the final answer between <answer> </answer> tags,
with a concrete format example. Now the same prompt that previously
said "Please provide only the single option letter ... within the
<answer> </answer> tags." also says "First reason briefly between
<think> </think> tags, then output ... <answer> </answer> tags. Format
example: <think>your reasoning</think><answer>A</answer>".
2. Switch the YAML reward set from "format(0.2)+exact_alnum(0.8)" to
"format(0.1)+exact_alnum_with_fallback(0.9)". The "with_fallback"
variant treats the entire response as the answer when the
<answer> </answer> tag is missing, so the model gets credit for
bare-letter answers too. format_reward stays in the mix at low
weight to nudge the policy toward the wrapped form so it eventually
emits <think>/<answer> structure that GRPO can rely on.
Together these unblock GRPO from the all-zero-reward starting point.
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
… fixes
Daily-Omni eval (was video-only on an audio-visual benchmark):
- DailyOmniDataset.format_data now emits an independent {type:audio} item
(16 kHz mono from the sibling *_audio.wav) alongside video so the
Qwen2.5-Omni chat template renders <|AUDIO|> and vLLM populates
multi_modal_data["audio"]; eval _format_for_eval locates the text item by
type instead of a fixed index.
- register DailyOmniEvalDataConfig in the eval-config union: MasterConfig is
now a pydantic BaseModel (upstream NVIDIA-NeMo#2325) and strictly validates
data.dataset_name, which rejected 'daily-omni' even though the loader
supported it.
- daily_omni.yaml: 32 video frames, audio:1 mm limit, max_model_len 32000,
and gpu_memory_utilization 0.5 + max_num_seqs 8 to avoid the multimodal
encoder activation OOM (vLLM batched ~66 clips into one encoder forward and
hard-crashed the workers; KV cache was <2% used).
- daily_omni prompt switched to the training think+answer template.
Intent recipe:
- intent.py renders multiple-choice options into the prompt (_format_options);
without them the model answered blind.
- 3B/7B megatron configs: full-throughput rollout batch sizes, per-forward
batch of 1 to dodge the Qwen2.5-Omni get_rope_index IndexError, and 7B
num_frames=8 to keep the training-forward activation memory in budget.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Reverts 5e9a818 (feat(grpo): expose per-component reward metrics in VLM validation) and its companion docs 0fd3f48 (docs(grpo-intent): match Results section to per-component validation metrics). 5e9a818 restructured VLMEnvironment to emit an (N,K) per-component reward tensor and added validation-loop logging of reward/<name> components. This was purely added observability — training advantage still used the combined scalar — so reverting changes no training/eval behaviour, only drops the extra per-component val logs. grpo.py returns to its origin/main state. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Remove the exact_answer_alphanumeric_with_fallback_reward function, its VLM-environment registration, and switch the Daily-Omni eval config to the strict exact_alnum reward. With the think+answer prompt the model reliably wraps its answer in <answer> tags, so the no-tag fallback never fires; recomputed strict-vs-fallback scores on the saved decodes are identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
- intent_grpo_7B_megatron.yaml now inherits grpo_math_1B_megatron.yaml directly (the same base the 3B recipe used) and inlines the intent-specific config, so it no longer depends on the 3B recipe. Resolved config is unchanged except checkpoint_dir, which is corrected from the inherited results/intent_grpo_3B_megatron to results/intent_grpo_7B_megatron. - untrack examples/configs/intent_grpo_3B_megatron.yaml (kept on disk). - rename docs/guides/grpo-intent.md -> grpo-audio-visual.md, update docs/index.md links, and rewrite the guide for the 7B recipe (8 frames, TP=2, batch 32/1, logprob_batch_size=1, save_period 20). Replace the stale smoke section with the real Daily-Omni eval flow and a base-vs-After-GRPO results table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
- tests/functional/audio_visual_grpo_megatron.sh: 2-step intent audio+video GRPO on the 7B recipe pinned to the lighter Qwen2.5-Omni-3B, asserting max(train/reward) > 0.6 and mean(train/token_mult_prob_error) < 1.05. Registered in L1_Functional_Tests_Megatron_1.sh (full mode only, not the fast lane, since it pulls the IntentTrain video dataset). - drop the ffmpeg-dependent intent-dataset unit test; fabricating an mp4 with an audio track needs ffmpeg, which the unit suite should not require. The audio+video sample-shape contract is covered by the functional test above and the vLLM-utils unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
…2 paper - title -> "Audio-Visual GRPO with Qwen2.5-Omni-7B" - intro + index card: evaluate on Daily-Omni (was PhilipC/IntentBench); index card model 3B -> 7B - link HumanOmniV2 to the paper (arxiv 2506.21277) instead of the GitHub repo - drop the use_audio_in_video notes from the intro, the 7B training-notes rope-IndexError bullet, and the AV-Event-Alignment sentence in Results Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Contributor
Author
|
@yuki-97 Hi Yuki, could you review the PR when you have a moment? Many thanks! |
Contributor
Author
|
/ok to test d9ca267 |
ruff-format adds the missing second blank line after _format_options (pre-commit ruff-format hook); no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Contributor
Author
|
/ok to test cc150ab |
The Daily-Omni eval dataset module is type-clean, so the pyrefly hook requires it on the project-includes whitelist. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Contributor
Author
|
/ok to test 2871d57 |
DailyOmniDataset.format_data now emits [video, audio, text] content items (audio added for the audio-visual recipe), so the content assertions check content[1]==audio and content[2]==text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Contributor
Author
|
/ok to test e1aa32c |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The PR supports audio visual GRPO training using qwen2.5-omni-7B.
Based on [HumanOmniV2](https://arxiv.org/abs/2506.21277) training set, the recipe could improve qwen-omni's video reasoning perfomance a lot.
Curve
Results on DailyOmni testset:
Daily-Omni accuracy (1197 questions, greedy decoding) for the base Qwen2.5-Omni-7B versus the GRPO-trained checkpoint: