Skip to content

feat: video + audio understanding GRPO training recipe #2823

Open
yuekaizhang wants to merge 20 commits into
NVIDIA-NeMo:mainfrom
yuekaizhang:audio_video
Open

feat: video + audio understanding GRPO training recipe #2823
yuekaizhang wants to merge 20 commits into
NVIDIA-NeMo:mainfrom
yuekaizhang:audio_video

Conversation

@yuekaizhang

@yuekaizhang yuekaizhang commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

The PR supports audio visual GRPO training using qwen2.5-omni-7B.

Based on [HumanOmniV2](https://arxiv.org/abs/2506.21277) training set, the recipe could improve qwen-omni's video reasoning perfomance a lot.

Curve

image

Results on DailyOmni testset:

Daily-Omni accuracy (1197 questions, greedy decoding) for the base Qwen2.5-Omni-7B versus the GRPO-trained checkpoint:

Question type Base After GRPO
Overall 0.498 0.590
AV Event Alignment 0.353 0.450
Comparative 0.618 0.725
Context understanding 0.446 0.534
Event Sequence 0.395 0.490
Inference 0.714 0.760
Reasoning 0.651 0.766

yuekaizhang and others added 15 commits June 10, 2026 01:38
- vlm_hf_data_processor accepts the daily-omni task and a `video` content
  type; loads frames via `transformers.video_utils.load_video` and emits
  `vllm_videos` alongside images/audios
- eval_collate_fn forwards `vllm_videos` and `_run_env_eval_impl` attaches
  them to vLLM `multi_modal_data["video"]`
- DailyOmniEvalDataset wraps DailyOmniDataset for the eval registry; strips
  the upstream "single letter only" instruction at the eval boundary so the
  prompt template alone dictates output format (SFT path untouched)
- examples/prompts/daily_omni.txt + examples/configs/evals/daily_omni.yaml
  drive Qwen/Qwen2.5-Omni-3B inference with `<answer>` formatting
- new `exact_alnum_with_fallback` reward mirrors HumanOmniV2 semantics
  (whole response treated as the answer when the tag is missing); the
  existing `exact_alnum` strict reward is unchanged

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Adds a NeMo-RL GRPO recipe that fine-tunes Qwen/Qwen2.5-Omni-3B on the
HumanOmniV2 PhilipC/IntentTrain audio-visual intent-recognition dataset
and validates on PhilipC/IntentBench. Each prompt feeds the Qwen2.5-Omni
processor both the video stream (16 frames) and the audio track decoded
from the same file at 16 kHz mono, with use_audio_in_video=True propagated
through apply_chat_template and through vLLM rollout's mm_processor_kwargs
so audio and video tokens are aligned.

New IntentDataset class downloads each HF repo via snapshot_download,
extracts videos.zip once (sentinel-guarded), filters manifests to
problem_type == "multiple choice", and emits messages with video path +
decord-decoded audio array + text prompt. Two registry entries
(intent-train, intent-bench) share the implementation.

Framework wiring:
- nemo_rl/data/processors.py: pass use_audio_in_video=True through
  apply_chat_template for intent tasks.
- nemo_rl/models/generation/vllm/utils.py: add multi_modal_data["video"]
  forwarding alongside the existing image/audio keys, and set
  mm_processor_kwargs={"use_audio_in_video": True} for intent rollouts
  that carry both modalities.
- nemo_rl/evals/eval.py: same mm_processor_kwargs injection on the eval
  side.
- nemo_rl/data/collate_fn.py: eval_collate_fn now propagates task_name so
  the eval prompt builder can detect intent samples.

Recipe ships with examples/configs/intent_grpo_3B_megatron.yaml (mirrors
audio_grpo_3B_megatron.yaml: limit_mm_per_prompt {video:1, audio:1};
num_frames:16; max_total_sequence_length:8192; sequence_packing off;
mm_processor_cache_gb:0; format(0.2)+exact_alnum(0.8)) plus a docs guide
docs/guides/grpo-intent.md linked from docs/index.md.

Smoke validation (real GPU run) is the next round's task.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Qwen2.5-Omni's apply_chat_template path silently swallows
use_audio_in_video=True: when both an explicit {type:audio} content item
and a {type:video} item are present in the message, the chat template
emits two audio placeholders but the processor's audio_lengths iterator
sees only one entry and raises StopIteration in
replace_multimodal_special_tokens.

Switch IntentDataset to emit only video + text content items, and have
vlm_hf_data_processor extract the audio track from the video file inline
(via decord.AudioReader) when the task is intent-train/intent-bench.
For these tasks the processor is invoked manually in two steps:

  text = apply_chat_template(tokenize=False)
  inputs = processor(text=[text], videos=videos, audio=audios,
                     use_audio_in_video=True, return_tensors="pt")

This is the path HumanOmniV2 uses and the only path that produces both
audio features (input_features, feature_attention_mask) and video
features (pixel_values_videos, video_grid_thw, video_second_per_grid)
without tripping the duplicate-placeholder StopIteration.

Verified locally on a real IntentTrain sample: token_ids shape 7325
(under the YAML's 8192 budget), loss_multiplier=1.0, vllm_videos and
vllm_audios both populated.

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
vLLM's Qwen2.5-Omni multimodal pipeline asserts when the rendered text
prompt does not contain audio placeholder tokens but mm_items["audio"]
is provided -- "Failed to apply prompt replacement for mm_items['audio'][0]"
inside the rollout. The rendered text we put into vllm_content comes from
processor.apply_chat_template(tokenize=False, add_generation_prompt=True),
which only emits <|VIDEO|> placeholders for {type:video}; the audio
placeholders are inserted later by Qwen2.5-Omni's custom processor when
text is re-tokenized with use_audio_in_video=True.

Switch the rollout-time prompt format to prompt_token_ids for intent
tasks: the policy-side input_ids was built from the same two-step
processor invocation as vllm_audios/vllm_videos, so it already carries
both audio AND video placeholder tokens at the correct positions.
mm_processor_kwargs is no longer needed because vLLM does not
re-tokenize.

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
…udio_in_video)

The two-step processor.apply_chat_template + processor() path with
use_audio_in_video=True works for the policy-side message_log but vLLM's
Qwen2.5-Omni multimodal pipeline rejects it at rollout with
"Failed to apply prompt replacement for mm_items['audio'][0]" because
the rendered prompt text only carries <|VIDEO|> placeholders, not the
audio placeholders that vLLM expects to find before consuming an
mm_items["audio"] entry. Forcing prompt_token_ids did not help: vLLM
still applies its own multimodal prompt replacement.

Switch the IntentTrain/IntentBench data pipeline to feed audio and video
to the chat template as independent {type:audio} / {type:video} content
items. The chat template now renders both <|VIDEO|> and <|AUDIO|>
placeholders into the prompt, the existing single-step
apply_chat_template(tokenize=True, return_dict=True) path produces both
audio features (input_features, feature_attention_mask) and video
features (pixel_values_videos, video_grid_thw, video_second_per_grid),
and vLLM accepts mm_items["audio"] + mm_items["video"] without the
duplicate-placeholder error. The model still receives both modalities;
the only thing dropped is the explicit time-alignment hint, which we
defer to a follow-up since v1's vLLM stack does not support that path
for Qwen2.5-Omni.

Verified locally on a real IntentTrain sample: token_ids len=7325 under
the 8192 max, vllm_videos len=1, vllm_audios len=1, prompt has both
<|VIDEO|> and <|AUDIO|>.

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
…treams path

Round 1 ended on the explicit ``{type:audio}`` + ``{type:video}``
multimodal contract (Qwen2.5-Omni's chat template renders <|VIDEO|> +
<|AUDIO|> placeholders independently and vLLM rolls out both modalities
without ``use_audio_in_video=True``). The dataset module docstring,
class docstring, YAML header, and the public docs guide all still
described the abandoned ``use_audio_in_video`` / ``mm_processor_kwargs``
path; rewrite them to match the verified implementation and document
why the alignment hint is intentionally not used in v1.

Also remove the "exercised end to end" claim from the docs guide and
replace it with the actual smoke configuration plus the
``HF_HUB_OFFLINE=1`` requirement that surfaced when the Megatron
tokenizer worker hit a network read timeout.

Add regression coverage for the contract before closing the round:

- tests/unit/models/generation/test_vllm_utils.py:
  ``test_vllm_utils_vlm_with_audio_and_video_intent_path`` builds a
  BatchedDataDict with ``vllm_videos`` AND ``vllm_audios`` plus
  ``task_name=["intent-train", "intent-bench"]`` and asserts
  ``multi_modal_data`` carries both ``video`` and ``audio`` keys for
  every prompt and that ``mm_processor_kwargs`` is NOT set.

- tests/unit/data/datasets/test_intent_dataset.py: fabricates a fake
  IntentTrain HF snapshot (manifest + .mp4 with audio + sentinel),
  monkeypatches ``snapshot_download`` and ``get_huggingface_cache_path``,
  and asserts every yielded sample emits exactly one type=video, one
  type=audio (np.float32 1-D array), and one type=text content item.
  ``free-form`` samples are dropped by the allow-list.

Both new tests fail if a future change reverts the contract (drops
``vllm_audios`` from format_prompt_for_vllm_generation, or restores the
``use_audio_in_video=True`` / single-stream path).

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Replace the stale "joint via use_audio_in_video=True" cell in the
hyperparameter table with the actual independent <|VIDEO|> / <|AUDIO|>
placeholder shape that landed in code. Replace the previous overclaiming
Results section with the exact smoke command Round 3 executed end to end
on 4 x H100 80GB:

  - HF_HUB_OFFLINE=1 / TRANSFORMERS_OFFLINE=1 to avoid the Megatron
    tokenizer worker's AutoTokenizer.from_pretrained timeout.
  - policy.tokenizer.video.num_frames=4 + max_total_sequence_length=4096
    so the multimodal forward fits inside the GPU budget vLLM leaves
    resident.
  - policy.megatron_cfg.activation_checkpointing=true to drop activation
    memory below the same budget.
  - policy.generation.vllm_cfg.gpu_memory_utilization=0.5 so vLLM's KV
    cache does not crowd Megatron training out of the GPU.
  - policy.logprob_batch_size=1 to match the per-DP-rank slice when
    train_global_batch_size=4 on 4 GPUs (the YAML default of 4 trips
    "Data dict size (1) is not a multiple of microbatch size").
  - cluster.gpus_per_node=4 so global_batch_size=4 satisfies Megatron's
    divisibility assertion against data-parallel size.

Document the runtime evidence the run produced:
  - val_at_start validation reached the IntentBench loader and reported
    "Accuracy: 0.0000, response length 2.0 tokens, 4 samples processed".
  - Steps 1 and 2 trained and saved checkpoints under
    results/intent_grpo_3B_megatron/step_{1,2}/policy/weights.
  - convert_megatron_to_hf.py (with --extra mcore) wrote
    results/intent_grpo_3B_megatron/step_2/hf/{config.json,
    model-*.safetensors, model.safetensors.index.json,
    chat_template.jinja, generation_config.json} and printed
    "All tensors from the original checkpoint were written."
  - format_prompt_for_vllm_generation produces multi_modal_data keys
    ['audio','video'] with video=(N,H,W,3) ndarray and audio=(np_array,
    16000), with mm_processor_kwargs absent, and the rendered prompt
    contains both <|VIDEO|> and <|AUDIO|> placeholders.

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
VLMVerifyWorker previously collapsed the configured reward functions into
a single combined scalar via combine_reward_functions, so the validation
loop in nemo_rl/algorithms/grpo.py::validate could only log a single
"accuracy" number and the per-sample rewards JSONL stored only the
combined value. The grpo-intent guide promised "validation reward and
answer-correctness reward signal" in wandb / tensorboard but those
component metrics were never actually emitted.

Restructure so the per-function scores survive end to end:

- nemo_rl/environments/vlm_environment.py: factor reward construction
  into _build_named_reward_functions returning (name, fn, weight)
  triples. VLMVerifyWorker keeps a verify() shim for back-compat (still
  returning the combined scalar) and adds verify_with_components which
  returns both the combined list AND a per-sample list of weighted
  per-function scores in a stable order. VLMEnvironment.step now calls
  verify_with_components and returns rewards as an (N, K) tensor of
  weighted components -- summing along dim=1 reproduces the historical
  scalar total_reward GRPO uses for advantage computation, and the
  rollout's existing multi-reward path (run_multi_turn_rollout) already
  promotes the K columns to reward1, reward2, ... batch keys when K>1.
  Adds a Ray-callable reward_component_names() accessor so callers can
  map column index back to the configured name.

- nemo_rl/algorithms/grpo.py::validate: when reward<i> columns are
  present on the val batch, fetch the human names via the env's
  reward_component_names accessor, accumulate per-component sample
  values across val batches, compute means, surface them as
  reward/<name> entries in val_metrics, print them in the validation
  summary block, and write per-sample component values into the
  val_data_step{N}.jsonl artifact alongside the combined rewards.

For envs that still return a 1-D (N,) rewards tensor (math, retriever,
sliding_puzzle, code_jaccard, reward_model, nemo_gym, ...) nothing
changes: the rollout single-reward branch keeps total_reward as before
and the per-component plumbing stays inert.

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
…metrics

Round 4 implemented the per-component reward emission in the VLM env +
GRPO validate (commit b49177f6). The grpo-intent guide's Results
section previously promised "validation reward and answer-correctness
reward signal" but did not enumerate the concrete metric keys; replace
that wording with the exact artifacts the validation loop now writes:

- Stdout block: Accuracy, Average response length, Samples processed,
  and a "Per-component reward (weighted):" sub-block listing one line
  per configured reward function (e.g. format, exact_alnum). Each
  listed value is weighted, so summing the components reproduces the
  combined reward.
- val_metrics handed off to wandb / tensorboard: accuracy, avg_length,
  and reward/<name> entries (e.g. reward/format, reward/exact_alnum).
- val_data_step{N}.jsonl: each row gets the existing content / idx /
  rewards columns plus one new column per reward component
  (reward/format, reward/exact_alnum) carrying per-sample weighted
  scores so eyeballing or plotting can split format vs answer-
  correctness signal directly.

Also documents the implementation seam so future readers know that
the (N, K) rewards tensor from the env flows through the rollout's
existing multi-reward path and that VLMEnvironment.reward_component_names()
is the Ray-callable accessor used by the validation loop.

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
…ootstrap GRPO

Initial training showed both validation rewards stuck at 0 because the
Qwen2.5-Omni-3B base emits a bare letter (e.g. "B") instead of
"<answer>B</answer>", so neither reward function ever fires:

  * format_reward only awards points for <think> and <answer> tags.
  * exact_alnum_reward extracts the content of <answer></answer> and
    returns 0 if the tag is missing entirely.

Two fixes:

1. Strengthen the per-problem-type instruction in IntentDataset so the
   prompt explicitly asks the model to first reason between <think> </think>
   tags and then commit the final answer between <answer> </answer> tags,
   with a concrete format example. Now the same prompt that previously
   said "Please provide only the single option letter ... within the
   <answer> </answer> tags." also says "First reason briefly between
   <think> </think> tags, then output ... <answer> </answer> tags. Format
   example: <think>your reasoning</think><answer>A</answer>".

2. Switch the YAML reward set from "format(0.2)+exact_alnum(0.8)" to
   "format(0.1)+exact_alnum_with_fallback(0.9)". The "with_fallback"
   variant treats the entire response as the answer when the
   <answer> </answer> tag is missing, so the model gets credit for
   bare-letter answers too. format_reward stays in the mix at low
   weight to nudge the policy toward the wrapped form so it eventually
   emits <think>/<answer> structure that GRPO can rely on.

Together these unblock GRPO from the all-zero-reward starting point.

Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
… fixes

Daily-Omni eval (was video-only on an audio-visual benchmark):
- DailyOmniDataset.format_data now emits an independent {type:audio} item
  (16 kHz mono from the sibling *_audio.wav) alongside video so the
  Qwen2.5-Omni chat template renders <|AUDIO|> and vLLM populates
  multi_modal_data["audio"]; eval _format_for_eval locates the text item by
  type instead of a fixed index.
- register DailyOmniEvalDataConfig in the eval-config union: MasterConfig is
  now a pydantic BaseModel (upstream NVIDIA-NeMo#2325) and strictly validates
  data.dataset_name, which rejected 'daily-omni' even though the loader
  supported it.
- daily_omni.yaml: 32 video frames, audio:1 mm limit, max_model_len 32000,
  and gpu_memory_utilization 0.5 + max_num_seqs 8 to avoid the multimodal
  encoder activation OOM (vLLM batched ~66 clips into one encoder forward and
  hard-crashed the workers; KV cache was <2% used).
- daily_omni prompt switched to the training think+answer template.

Intent recipe:
- intent.py renders multiple-choice options into the prompt (_format_options);
  without them the model answered blind.
- 3B/7B megatron configs: full-throughput rollout batch sizes, per-forward
  batch of 1 to dodge the Qwen2.5-Omni get_rope_index IndexError, and 7B
  num_frames=8 to keep the training-forward activation memory in budget.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Reverts 5e9a818 (feat(grpo): expose per-component reward metrics in VLM
validation) and its companion docs 0fd3f48 (docs(grpo-intent): match
Results section to per-component validation metrics).

5e9a818 restructured VLMEnvironment to emit an (N,K) per-component reward
tensor and added validation-loop logging of reward/<name> components. This
was purely added observability — training advantage still used the combined
scalar — so reverting changes no training/eval behaviour, only drops the
extra per-component val logs. grpo.py returns to its origin/main state.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Remove the exact_answer_alphanumeric_with_fallback_reward function, its
VLM-environment registration, and switch the Daily-Omni eval config to the
strict exact_alnum reward. With the think+answer prompt the model reliably
wraps its answer in <answer> tags, so the no-tag fallback never fires;
recomputed strict-vs-fallback scores on the saved decodes are identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
- intent_grpo_7B_megatron.yaml now inherits grpo_math_1B_megatron.yaml
  directly (the same base the 3B recipe used) and inlines the intent-specific
  config, so it no longer depends on the 3B recipe. Resolved config is
  unchanged except checkpoint_dir, which is corrected from the inherited
  results/intent_grpo_3B_megatron to results/intent_grpo_7B_megatron.
- untrack examples/configs/intent_grpo_3B_megatron.yaml (kept on disk).
- rename docs/guides/grpo-intent.md -> grpo-audio-visual.md, update docs/index.md
  links, and rewrite the guide for the 7B recipe (8 frames, TP=2, batch 32/1,
  logprob_batch_size=1, save_period 20). Replace the stale smoke section with
  the real Daily-Omni eval flow and a base-vs-After-GRPO results table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
- tests/functional/audio_visual_grpo_megatron.sh: 2-step intent audio+video
  GRPO on the 7B recipe pinned to the lighter Qwen2.5-Omni-3B, asserting
  max(train/reward) > 0.6 and mean(train/token_mult_prob_error) < 1.05.
  Registered in L1_Functional_Tests_Megatron_1.sh (full mode only, not the
  fast lane, since it pulls the IntentTrain video dataset).
- drop the ffmpeg-dependent intent-dataset unit test; fabricating an mp4 with
  an audio track needs ffmpeg, which the unit suite should not require. The
  audio+video sample-shape contract is covered by the functional test above
  and the vLLM-utils unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
@yuekaizhang yuekaizhang requested review from a team as code owners June 15, 2026 07:48
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Jun 15, 2026
…2 paper

- title -> "Audio-Visual GRPO with Qwen2.5-Omni-7B"
- intro + index card: evaluate on Daily-Omni (was PhilipC/IntentBench);
  index card model 3B -> 7B
- link HumanOmniV2 to the paper (arxiv 2506.21277) instead of the GitHub repo
- drop the use_audio_in_video notes from the intro, the 7B training-notes
  rope-IndexError bullet, and the AV-Event-Alignment sentence in Results

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
@yuekaizhang

Copy link
Copy Markdown
Contributor Author

@yuki-97 Hi Yuki, could you review the PR when you have a moment? Many thanks!

@yuekaizhang

Copy link
Copy Markdown
Contributor Author

/ok to test d9ca267

@yuekaizhang yuekaizhang added the CI:L1 Run doctests, unit tests, and functional tests label Jun 16, 2026
ruff-format adds the missing second blank line after _format_options
(pre-commit ruff-format hook); no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
@yuekaizhang

Copy link
Copy Markdown
Contributor Author

/ok to test cc150ab

The Daily-Omni eval dataset module is type-clean, so the pyrefly hook
requires it on the project-includes whitelist.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
@yuekaizhang yuekaizhang requested a review from a team as a code owner June 16, 2026 04:53
@yuekaizhang

Copy link
Copy Markdown
Contributor Author

/ok to test 2871d57

DailyOmniDataset.format_data now emits [video, audio, text] content items
(audio added for the audio-visual recipe), so the content assertions check
content[1]==audio and content[2]==text.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
@yuekaizhang

Copy link
Copy Markdown
Contributor Author

/ok to test e1aa32c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant