NVIDIA-NeMo · yuekaizhang · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
@@ -0,0 +1,86 @@
+# Audio-Visual GRPO with Qwen2.5-Omni-7B
+
+This guide explains how to use NeMo RL to train [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) with GRPO on the [PhilipC/IntentTrain](https://huggingface.co/datasets/PhilipC/IntentTrain) audio-visual intent-recognition dataset and evaluate on [Daily-Omni](https://huggingface.co/datasets/liarliar/Daily-Omni), following the dataset structure used in [HumanOmniV2](https://arxiv.org/abs/2506.21277).
+
+Each training sample feeds the Qwen2.5-Omni processor both the video stream (8 frames) and the audio track decoded from the same file at 16 kHz mono. Audio and video flow as two **independent multimodal items** per prompt: the dataset emits `{type: video}` + `{type: audio}` content items, the Qwen2.5-Omni chat template renders both `<|VIDEO|>` and `<|AUDIO|>` placeholders, and vLLM rollouts populate `multi_modal_data["video"]` and `multi_modal_data["audio"]` from the same sample.
+
+## 1. Train the Model
+
+Run GRPO training with the provided config:
+
+```
+uv run examples/run_vlm_grpo.py --config examples/configs/intent_grpo_7B_megatron.yaml
+```
+
+Config: `examples/configs/intent_grpo_7B_megatron.yaml`
+
+Key hyperparameters:
+
+| Parameter | Value |
+| --- | --- |
+| Model | Qwen2.5-Omni-7B |
+| Train dataset | PhilipC/IntentTrain (problem_type = "multiple choice") |
+| Validation dataset | PhilipC/IntentBench (problem_type = "multiple choice") |
+| Modalities per prompt | video (8 frames, `<\|VIDEO\|>` placeholder) + audio (16 kHz mono, `<\|AUDIO\|>` placeholder) — independent multimodal items, no `use_audio_in_video` alignment |
+| GPUs | 8 x 1 node, Megatron backend, `tensor_model_parallel_size=2` (data parallel = 4) |
+| Learning rate | 1e-6 |
+| KL penalty | 0.01 |
+| Generations per prompt | 8 |
+| Prompts per step | 32 |
+| Train global / micro batch | 32 / 1 |
+| Max steps | 1000 |
+| Save period | 20 |
+| Reward | format (0.2) + exact_alnum (0.8) |
+
+The dataset class downloads `PhilipC/IntentTrain` and `PhilipC/IntentBench` via `huggingface_hub.snapshot_download` and extracts each `videos.zip` once into the corresponding HuggingFace cache directory. Re-instantiating the dataset on a machine that already has the archives extracted is a no-op.
+
+Only `problem_type == "multiple choice"` samples are used. The allow-list is configurable through `data.train.allowed_problem_types` and `data.validation.allowed_problem_types` if you want to extend scope (for example, to `emer_ov_mc`); doing so requires picking an answer-correctness reward that handles those answer formats.
+
+### 7B training notes
+
+- **8 video frames** keep the prompt around ~4.5k tokens (8×360 video + ~1.5k audio + text), under `max_total_sequence_length=8192`, and roughly halve the training-forward activation memory versus 16 frames. Do **not** switch to fps-based sampling — at fps=2 the clips expand to ~43k video tokens, blow past the token budget, and `vlm_hf_data_processor` then empties the multimodal items and sets `loss_multiplier=0`.
+- **`activation_checkpointing: true` + `gpu_memory_utilization: 0.4`** keep the Megatron forward inside the memory vLLM leaves resident after sleep mode. If `tensor_model_parallel_size=2` OOMs, fall back to `tensor_model_parallel_size=4` (proven to run at 8 frames).
+- If `loss_multiplier` is logged at 0 for many samples, the multimodal prompt is exceeding `max_total_sequence_length`; bump it until validation samples consistently produce non-zero loss.
+- Set `HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1` once `Qwen/Qwen2.5-Omni-7B`, `PhilipC/IntentTrain`, and `PhilipC/IntentBench` are pre-fetched, so Megatron's tokenizer worker doesn't hit the network.
+
+## 2. Convert Checkpoint (Megatron to HF)
+
+Checkpoints are saved under `results/intent_grpo_7B_megatron` (`checkpointing.checkpoint_dir`), one every `save_period=20` steps. Convert a checkpoint from Megatron to Hugging Face format before evaluating:
+
+```
+uv run --extra mcore python examples/converters/convert_megatron_to_hf.py \
+    --config results/intent_grpo_7B_megatron/step_43/config.yaml \
+    --megatron-ckpt-path results/intent_grpo_7B_megatron/step_43/policy/weights/iter_0000000 \
+    --hf-ckpt-path results/intent_grpo_7B_megatron/step_43/hf --no-strict
+```
+
+Replace the step number with the checkpoint you want to evaluate. `--no-strict` is expected here: only the Qwen2.5-Omni *thinker* is trained, so the talker tensors are reported as "not written". The `--extra mcore` flag is required for the Megatron converter.
+
+## 3. Evaluate
+
+In-training validation uses IntentBench as the validation set, so `val_period`, `val_batch_size`, and `max_val_samples` from the config drive evaluation cadence.
+
+For a standalone benchmark, decode the converted HF checkpoint on [Daily-Omni](https://huggingface.co/datasets/liarliar/Daily-Omni) (1197 audio-visual multiple-choice questions) with `examples/run_eval.py`:
+
+```
+uv run examples/run_eval.py --config examples/configs/evals/daily_omni.yaml \
+    generation.model_name=results/intent_grpo_7B_megatron/step_43/hf
+```
+
+The eval config (`examples/configs/evals/daily_omni.yaml`) feeds audio + video (32 frames — eval has no training-forward memory pressure, so it samples more densely than training), uses the same think+answer prompt as training, and scores with `exact_alnum` (case-insensitive exact match on the `<answer>` content).
+
+## 4. Results
+
+Daily-Omni accuracy (1197 questions, greedy decoding) for the base Qwen2.5-Omni-7B versus the GRPO-trained checkpoint:
+
+| Question type | Base | After GRPO |
+| --- | --- | --- |
+| **Overall** | **0.498** | **0.590** |
+| AV Event Alignment | 0.353 | 0.450 |
+| Comparative | 0.618 | 0.725 |
+| Context understanding | 0.446 | 0.534 |
+| Event Sequence | 0.395 | 0.490 |
+| Inference | 0.714 | 0.760 |
+| Reasoning | 0.651 | 0.766 |
+
+GRPO lifts overall Daily-Omni accuracy by ~9 points, with gains across every question category. The largest relative gains are on the reasoning-style questions.
@@ -121,6 +121,13 @@ Configure offline and online Eagle3 draft-model workflows to accelerate rollout
 Train Qwen2.5-Omni-3B with GRPO on AVQA and evaluate on MMAU, following the R1-AQA approach.
 :::
 
+:::{grid-item-card} {octicon}`device-camera-video` Audio+Video Intent GRPO
+:link: guides/grpo-audio-visual
+:link-type: doc
+
+Train Qwen2.5-Omni-7B with GRPO on PhilipC/IntentTrain (audio-visual intent recognition) and evaluate on Daily-Omni, following HumanOmniV2's joint audio+video setup.
+:::
+
 :::{grid-item-card} {octicon}`plus-circle` Adding New Models
 :link: adding-new-models
 :link-type: doc
@@ -259,6 +266,7 @@ guides/ppo.md
 guides/grpo-deepscaler.md
 guides/grpo-sliding-puzzle.md
 guides/grpo-audio.md
+guides/grpo-audio-visual.md
 guides/rm.md
 guides/environments.md
 guides/eval.md

@@ -0,0 +1,82 @@
+eval:
+  metric: "pass@k"
+  num_tests_per_prompt: 1
+  seed: 42
+  k_value: 1
+  save_path: results/daily_omni_decode.json
+
+generation:
+  backend: "vllm"
+  max_new_tokens: 2048
+  temperature: 0.0
+  top_p: 1.0
+  top_k: -1
+  num_prompts_per_step: -1
+  model_name: "Qwen/Qwen2.5-Omni-3B"
+  stop_token_ids: null
+  stop_strings: null
+  vllm_cfg:
+    async_engine: false
+    precision: "bfloat16"
+    tensor_parallel_size: 1
+    pipeline_parallel_size: 1
+    expert_parallel_size: 1
+    # 0.9 -> 0.5: with 32 video frames + audio, the Qwen2.5-Omni vision/audio
+    # encoder forward needs a large chunk of *transient activation* memory that
+    # lives outside vLLM's KV-cache budget. At 0.9 the KV cache claims almost
+    # all VRAM (56+ GiB) and the first multimodal forward OOM-crashes the vLLM
+    # workers (hard EOF, no graceful torch OOM). 0.5 leaves ample headroom; KV
+    # cache is still ~1M tokens, far more than eval needs.
+    gpu_memory_utilization: 0.5
+    # Bumped from 16000 to fit 32 video frames + the 16 kHz audio track
+    # without truncating the multimodal prompt (truncation silently masks
+    # samples out and collapses their reward to 0).
+    max_model_len: 32000
+    enforce_eager: False
+    skip_tokenizer_init: False
+    limit_mm_per_prompt:
+      video: 1
+      audio: 1
+  vllm_kwargs:
+    # Disable mm processor cache to avoid vLLM cache eviction during eval.
+    mm_processor_cache_gb: 0
+    # Cap concurrent sequences so the Qwen2.5-Omni vision/audio encoder only
+    # processes a few clips per step. With audio + 32 video frames, vLLM
+    # otherwise batches ~66 clips into one encoder forward and OOM-crashes the
+    # workers (kv_cache_usage was ~2% at crash -> it is encoder *activation*
+    # memory, not KV cache). 8 keeps the encoder batch small; eval throughput
+    # is not a concern.
+    max_num_seqs: 8
+  colocated:
+    enabled: true
+    resources:
+      gpus_per_node: null
+      num_nodes: null
+
+tokenizer:
+  name: ${generation.model_name}
+  chat_template: "default"
+  chat_template_kwargs: null
+  video:
+    # 16 -> 32 frames: 60s clips at 16 frames is ~1 frame / 3.75s, too sparse
+    # for fine-grained temporal (Event Sequence) questions.
+    num_frames: 32
+
+data:
+  max_input_seq_length: ${generation.vllm_cfg.max_model_len}
+  prompt_file: examples/prompts/daily_omni.txt
+  system_prompt_file: null
+  dataset_name: "daily-omni"
+  split: "train"
+  env_name: vlm
+
+env:
+  vlm:
+    num_workers: 8
+    reward_functions:
+    - name: exact_alnum
+      weight: 1.0
+
+cluster:
+  gpus_per_node: 1
+  num_nodes: 1
@@ -0,0 +1,168 @@
+# Intent (audio+video) GRPO 7B Megatron configuration.
+#
+# Trains Qwen/Qwen2.5-Omni-7B with GRPO on PhilipC/IntentTrain (intent
+# recognition over short MER24 / social_iq video clips with audio) and runs
+# in-training validation on PhilipC/IntentBench.
+#   * Audio and video reach the model as two independent multimodal items
+#     per prompt: the dataset emits {type: video} + {type: audio}, the chat
+#     template renders <|VIDEO|> and <|AUDIO|> placeholders, and vLLM
+#     rollouts pass them as multi_modal_data["video"] / multi_modal_data["audio"].
+#     use_audio_in_video=True / mm_processor_kwargs are NOT used because the
+#     installed transformers + vLLM Qwen2.5-Omni stack rejected that path.
+#   * Only problem_type == "multiple choice" samples are used; rewards reuse
+#     the audio recipe's format + exact_alnum.
+#
+# 7B requires more aggressive sharding than 3B to fit on 80 GB H100s alongside
+# vLLM rollout memory:
+#   * tensor_model_parallel_size: 2 -> model state sharded across 2 ranks,
+#     data parallel size = gpus_per_node / TP = 4 with 8 GPUs.
+#   * per-forward batch must be exactly 1 sample/rank (train_micro_batch_size=1,
+#     logprob_batch_size=1), else the Qwen2.5-Omni get_rope_index path crashes
+#     with "IndexError: index 1 is out of bounds for dimension 0 with size 1".
+#   * num_frames 8 (vs the 3B recipe's 16) to roughly halve the prompt length
+#     and the training-forward activation memory.
+#   * activation_checkpointing on, vllm gpu_memory_utilization 0.4 to leave
+#     headroom for the Megatron forward.
+#
+# Inherits directly from grpo_math_1B_megatron.yaml (the same base the 3B
+# recipe uses) and overrides intent-specific + 7B-specific settings.
+defaults: "grpo_math_1B_megatron.yaml"
+
+grpo:
+  num_prompts_per_step: 32
+  num_generations_per_prompt: 8
+  max_num_steps: 1000
+  val_at_start: false
+  max_val_samples: 256
+  val_batch_size: 32
+
+checkpointing:
+  enabled: true
+  checkpoint_dir: results/intent_grpo_7B_megatron
+  keep_top_k: 10
+  # save_period 20: a 1-epoch (~85-step) 7B run is slow (~6 min/step) and
+  # previously hit the Slurm time limit at ~step 30 with checkpoints/ still
+  # EMPTY. 20 lands a checkpoint at steps 20/40/60/80. checkpoint_must_save_by
+  # additionally forces a save once 3h45m of wall-clock have elapsed so
+  # progress survives the job time limit (format DD:HH:MM:SS).
+  save_period: 20
+  checkpoint_must_save_by: "00:03:45:00"
+
+policy:
+  model_name: Qwen/Qwen2.5-Omni-7B
+  # PER-FORWARD batch must be exactly 1 sample/rank, else the Qwen2.5-Omni
+  # get_rope_index path crashes with "IndexError: index 1 is out of bounds for
+  # dimension 0 with size 1" (input_ids batch > attention_mask batch). That is
+  # controlled by train_micro_batch_size=1 (train forward) and
+  # logprob_batch_size=1 (log-prob forward). train_global_batch_size=32 only
+  # sets gradient accumulation and must stay divisible by micro x DP
+  # (32 % (1 x DP=4) == 0).
+  train_global_batch_size: 32
+  train_micro_batch_size: 1
+  generation_batch_size: 32
+  logprob_batch_size: 1
+  # Audio + video produces materially more tokens than the audio-only recipe;
+  # this budget keeps loss_multiplier > 0 with headroom. The video frame count
+  # (tokenizer.video.num_frames) is the dominant lever on prompt length -- do
+  # not raise it (or switch to fps) without raising this too.
+  max_total_sequence_length: 8192
+
+  tokenizer:
+    video:
+      # 7B: 8 frames (vs the 3B recipe's 16) to roughly halve the prompt length
+      # (~7.3k -> ~4.5k tokens: 8x360 video + ~1.5k audio + text) and thus the
+      # training-forward activation memory. NOTE: stopgap -- the proper fix
+      # (matching HumanOmniV2, which only trains the LLM) is to FREEZE the
+      # vision/audio encoders, which needs a code hook (no YAML knob exists).
+      # DO NOT switch to fps-based sampling: fps=2 expands the clips to ~43k
+      # video tokens, blows past max_total_sequence_length / vLLM max_model_len,
+      # and vlm_hf_data_processor then empties the multimodal items
+      # (loss_multiplier=0). fps and num_frames are mutually exclusive.
+      num_frames: 8
+
+  sequence_packing:
+    enabled: false
+
+  generation:
+    max_new_tokens: 1024
+    vllm_cfg:
+      # Audio/multimodal models require tokenizer to be initialized before generation
+      skip_tokenizer_init: False
+      # 7B model state crowds the GPU; lower vLLM cache budget so Megatron has
+      # room for activations during the training-time forward pass.
+      gpu_memory_utilization: 0.4
+      limit_mm_per_prompt:
+        video: 1
+        audio: 1
+    vllm_kwargs:
+      # Disable mm processor cache to avoid vLLM cache eviction assertion error during validation.
+      mm_processor_cache_gb: 0
+
+  megatron_cfg:
+    converter_type: Qwen2_5OmniForConditionalGeneration
+    apply_rope_fusion: false
+    activation_checkpointing: true
+    # TP=2 (DP=4 on 8 GPUs) -- 2x the data-parallel throughput of TP=4. Valid
+    # TP values are 1/2/4 (num_attention_heads=28 must be divisible by TP; TP=8
+    # fails). At num_frames=8 (~4.5k-token sequence) the logits/activation
+    # memory is ~40% smaller than at 16 frames, so TP=2 fits. If it OOMs, fall
+    # back to tensor_model_parallel_size=4 (proven to run at 8 frames).
+    tensor_model_parallel_size: 2
+    optimizer:
+      lr: 1.0e-6
+      min_lr: 1.0e-7
+    scheduler:
+      lr_warmup_iters: 10
+      lr_warmup_init: 1.0e-7
+    distributed_data_parallel_config:
+      overlap_grad_reduce: false
+
+data:
+  num_workers: 0
+  train:
+    dataset_name: intent-train
+    split: train
+    allowed_problem_types:
+      - "multiple choice"
+  validation:
+    dataset_name: intent-bench
+    split: validation
+    allowed_problem_types:
+      - "multiple choice"
+  default:
+    prompt_file: null
+    system_prompt_file: null
+    processor: "vlm_hf_data_processor"
+    env_name: "vlm"
+
+env:
+  vlm:
+    num_workers: 8
+    # Strict two-signal reward (format + accuracy), same structure as the
+    # HumanOmniV2 reference. The IntentDataset prompt instructs the model to
+    # reason between <think> </think> and commit the answer between
+    # <answer> </answer> tags:
+    #   * format     -- rewards the <think>...</think><answer>...</answer>
+    #                   structure (does not gate correctness).
+    #   * exact_alnum -- case-insensitive exact match on the <answer> content;
+    #                   returns 0 when the <answer> tag is missing, so the model
+    #                   must emit the wrapped form to earn the accuracy signal.
+    reward_functions:
+    - name: format
+      weight: 0.2
+    - name: exact_alnum
+      weight: 0.8
+
+logger:
+  wandb_enabled: true
+  tensorboard_enabled: true
+  monitor_gpus: false
+  wandb:
+    project: grpo-dev
+    name: intent-grpo-7b-megatron
+  swanlab:
+    project: grpo-dev
+    name: intent-grpo-7b-megatron
+
+cluster:
+  gpus_per_node: 8
@@ -0,0 +1 @@
+{} First reason briefly between <think> </think> tags, then output only the single option letter (e.g., A, B, C, D, ...) between <answer> </answer> tags. Format example: <think>your reasoning</think><answer>A</answer>
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{} First reason briefly between <think> </think> tags, then output only the single option letter (e.g., A, B, C, D, ...) between <answer> </answer> tags. Format example: <think>your reasoning</think><answer>A</answer>