Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
ddc76a7
feat(eval): support Daily-Omni + Qwen2.5-Omni eval
yuekaizhang Jun 10, 2026
74b2a69
feat(grpo): add audio+video Intent GRPO recipe for Qwen2.5-Omni-3B
yuekaizhang Jun 10, 2026
6bfa8d5
fix(grpo-intent): use two-step processor call for audio+video samples
yuekaizhang Jun 10, 2026
25add99
fix(grpo-intent): pass prompt_token_ids to vLLM for audio+video samples
yuekaizhang Jun 10, 2026
57e0e53
fix(grpo-intent): pass audio + video as independent streams (no use_a…
yuekaizhang Jun 10, 2026
263d506
docs(grpo-intent): align comments + tests with verified independent-s…
yuekaizhang Jun 10, 2026
e2aeba9
docs(grpo-intent): match guide to verified independent-streams smoke run
yuekaizhang Jun 10, 2026
5e9a818
feat(grpo): expose per-component reward metrics in VLM validation
yuekaizhang Jun 10, 2026
0fd3f48
docs(grpo-intent): match Results section to per-component validation …
yuekaizhang Jun 10, 2026
e3dbc5e
fix(grpo-intent): explicit think+answer prompt + fallback reward to b…
yuekaizhang Jun 10, 2026
6fb6c00
feat(grpo-intent): audio+video Daily-Omni eval + intent prompt/config…
yuekaizhang Jun 15, 2026
2530250
revert: drop per-component VLM validation reward logging
yuekaizhang Jun 15, 2026
099ec14
refactor: drop exact_alnum_with_fallback reward
yuekaizhang Jun 15, 2026
37c2c11
refactor(grpo-audio-visual): standalone 7B recipe + 7B guide
yuekaizhang Jun 15, 2026
55bcf7f
test: add audio-visual GRPO megatron L1 functional test
yuekaizhang Jun 15, 2026
45ece43
docs(grpo-audio-visual): retitle, eval on Daily-Omni, link HumanOmniV…
yuekaizhang Jun 15, 2026
d9ca267
Merge branch 'main' into audio_video
yuekaizhang Jun 16, 2026
cc150ab
chore: apply ruff-format to intent dataset
yuekaizhang Jun 16, 2026
2871d57
chore: add eval_datasets/daily_omni.py to pyrefly project-includes
yuekaizhang Jun 16, 2026
e1aa32c
test: update test_dailyomni_dataset for audio+video content shape
yuekaizhang Jun 16, 2026
b5fd215
Merge branch 'main' into audio_video
yuekaizhang Jun 16, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions docs/guides/grpo-audio-visual.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Audio-Visual GRPO with Qwen2.5-Omni-7B

This guide explains how to use NeMo RL to train [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) with GRPO on the [PhilipC/IntentTrain](https://huggingface.co/datasets/PhilipC/IntentTrain) audio-visual intent-recognition dataset and evaluate on [Daily-Omni](https://huggingface.co/datasets/liarliar/Daily-Omni), following the dataset structure used in [HumanOmniV2](https://arxiv.org/abs/2506.21277).

Each training sample feeds the Qwen2.5-Omni processor both the video stream (8 frames) and the audio track decoded from the same file at 16 kHz mono. Audio and video flow as two **independent multimodal items** per prompt: the dataset emits `{type: video}` + `{type: audio}` content items, the Qwen2.5-Omni chat template renders both `<|VIDEO|>` and `<|AUDIO|>` placeholders, and vLLM rollouts populate `multi_modal_data["video"]` and `multi_modal_data["audio"]` from the same sample.

## 1. Train the Model

Run GRPO training with the provided config:

```
uv run examples/run_vlm_grpo.py --config examples/configs/intent_grpo_7B_megatron.yaml
```

Config: `examples/configs/intent_grpo_7B_megatron.yaml`

Key hyperparameters:

| Parameter | Value |
| --- | --- |
| Model | Qwen2.5-Omni-7B |
| Train dataset | PhilipC/IntentTrain (problem_type = "multiple choice") |
| Validation dataset | PhilipC/IntentBench (problem_type = "multiple choice") |
| Modalities per prompt | video (8 frames, `<\|VIDEO\|>` placeholder) + audio (16 kHz mono, `<\|AUDIO\|>` placeholder) — independent multimodal items, no `use_audio_in_video` alignment |
| GPUs | 8 x 1 node, Megatron backend, `tensor_model_parallel_size=2` (data parallel = 4) |
| Learning rate | 1e-6 |
| KL penalty | 0.01 |
| Generations per prompt | 8 |
| Prompts per step | 32 |
| Train global / micro batch | 32 / 1 |
| Max steps | 1000 |
| Save period | 20 |
| Reward | format (0.2) + exact_alnum (0.8) |

The dataset class downloads `PhilipC/IntentTrain` and `PhilipC/IntentBench` via `huggingface_hub.snapshot_download` and extracts each `videos.zip` once into the corresponding HuggingFace cache directory. Re-instantiating the dataset on a machine that already has the archives extracted is a no-op.

Only `problem_type == "multiple choice"` samples are used. The allow-list is configurable through `data.train.allowed_problem_types` and `data.validation.allowed_problem_types` if you want to extend scope (for example, to `emer_ov_mc`); doing so requires picking an answer-correctness reward that handles those answer formats.

### 7B training notes

- **8 video frames** keep the prompt around ~4.5k tokens (8×360 video + ~1.5k audio + text), under `max_total_sequence_length=8192`, and roughly halve the training-forward activation memory versus 16 frames. Do **not** switch to fps-based sampling — at fps=2 the clips expand to ~43k video tokens, blow past the token budget, and `vlm_hf_data_processor` then empties the multimodal items and sets `loss_multiplier=0`.
- **`activation_checkpointing: true` + `gpu_memory_utilization: 0.4`** keep the Megatron forward inside the memory vLLM leaves resident after sleep mode. If `tensor_model_parallel_size=2` OOMs, fall back to `tensor_model_parallel_size=4` (proven to run at 8 frames).
- If `loss_multiplier` is logged at 0 for many samples, the multimodal prompt is exceeding `max_total_sequence_length`; bump it until validation samples consistently produce non-zero loss.
- Set `HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1` once `Qwen/Qwen2.5-Omni-7B`, `PhilipC/IntentTrain`, and `PhilipC/IntentBench` are pre-fetched, so Megatron's tokenizer worker doesn't hit the network.

## 2. Convert Checkpoint (Megatron to HF)

Checkpoints are saved under `results/intent_grpo_7B_megatron` (`checkpointing.checkpoint_dir`), one every `save_period=20` steps. Convert a checkpoint from Megatron to Hugging Face format before evaluating:

```
uv run --extra mcore python examples/converters/convert_megatron_to_hf.py \
--config results/intent_grpo_7B_megatron/step_43/config.yaml \
--megatron-ckpt-path results/intent_grpo_7B_megatron/step_43/policy/weights/iter_0000000 \
--hf-ckpt-path results/intent_grpo_7B_megatron/step_43/hf --no-strict
```

Replace the step number with the checkpoint you want to evaluate. `--no-strict` is expected here: only the Qwen2.5-Omni *thinker* is trained, so the talker tensors are reported as "not written". The `--extra mcore` flag is required for the Megatron converter.

## 3. Evaluate

In-training validation uses IntentBench as the validation set, so `val_period`, `val_batch_size`, and `max_val_samples` from the config drive evaluation cadence.

For a standalone benchmark, decode the converted HF checkpoint on [Daily-Omni](https://huggingface.co/datasets/liarliar/Daily-Omni) (1197 audio-visual multiple-choice questions) with `examples/run_eval.py`:

```
uv run examples/run_eval.py --config examples/configs/evals/daily_omni.yaml \
generation.model_name=results/intent_grpo_7B_megatron/step_43/hf
```

The eval config (`examples/configs/evals/daily_omni.yaml`) feeds audio + video (32 frames — eval has no training-forward memory pressure, so it samples more densely than training), uses the same think+answer prompt as training, and scores with `exact_alnum` (case-insensitive exact match on the `<answer>` content).

## 4. Results

Daily-Omni accuracy (1197 questions, greedy decoding) for the base Qwen2.5-Omni-7B versus the GRPO-trained checkpoint:

| Question type | Base | After GRPO |
| --- | --- | --- |
| **Overall** | **0.498** | **0.590** |
| AV Event Alignment | 0.353 | 0.450 |
| Comparative | 0.618 | 0.725 |
| Context understanding | 0.446 | 0.534 |
| Event Sequence | 0.395 | 0.490 |
| Inference | 0.714 | 0.760 |
| Reasoning | 0.651 | 0.766 |

GRPO lifts overall Daily-Omni accuracy by ~9 points, with gains across every question category. The largest relative gains are on the reasoning-style questions.
8 changes: 8 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,13 @@ Configure offline and online Eagle3 draft-model workflows to accelerate rollout
Train Qwen2.5-Omni-3B with GRPO on AVQA and evaluate on MMAU, following the R1-AQA approach.
:::

:::{grid-item-card} {octicon}`device-camera-video` Audio+Video Intent GRPO
:link: guides/grpo-audio-visual
:link-type: doc

Train Qwen2.5-Omni-7B with GRPO on PhilipC/IntentTrain (audio-visual intent recognition) and evaluate on Daily-Omni, following HumanOmniV2's joint audio+video setup.
:::

:::{grid-item-card} {octicon}`plus-circle` Adding New Models
:link: adding-new-models
:link-type: doc
Expand Down Expand Up @@ -259,6 +266,7 @@ guides/ppo.md
guides/grpo-deepscaler.md
guides/grpo-sliding-puzzle.md
guides/grpo-audio.md
guides/grpo-audio-visual.md
guides/rm.md
guides/environments.md
guides/eval.md
Expand Down
82 changes: 82 additions & 0 deletions examples/configs/evals/daily_omni.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
eval:
metric: "pass@k"
num_tests_per_prompt: 1
seed: 42
k_value: 1
save_path: results/daily_omni_decode.json

generation:
backend: "vllm"
max_new_tokens: 2048
temperature: 0.0
top_p: 1.0
top_k: -1
num_prompts_per_step: -1
model_name: "Qwen/Qwen2.5-Omni-3B"
stop_token_ids: null
stop_strings: null
vllm_cfg:
async_engine: false
precision: "bfloat16"
tensor_parallel_size: 1
pipeline_parallel_size: 1
expert_parallel_size: 1
# 0.9 -> 0.5: with 32 video frames + audio, the Qwen2.5-Omni vision/audio
# encoder forward needs a large chunk of *transient activation* memory that
# lives outside vLLM's KV-cache budget. At 0.9 the KV cache claims almost
# all VRAM (56+ GiB) and the first multimodal forward OOM-crashes the vLLM
# workers (hard EOF, no graceful torch OOM). 0.5 leaves ample headroom; KV
# cache is still ~1M tokens, far more than eval needs.
gpu_memory_utilization: 0.5
# Bumped from 16000 to fit 32 video frames + the 16 kHz audio track
# without truncating the multimodal prompt (truncation silently masks
# samples out and collapses their reward to 0).
max_model_len: 32000
enforce_eager: False
skip_tokenizer_init: False
limit_mm_per_prompt:
video: 1
audio: 1
vllm_kwargs:
# Disable mm processor cache to avoid vLLM cache eviction during eval.
mm_processor_cache_gb: 0
# Cap concurrent sequences so the Qwen2.5-Omni vision/audio encoder only
# processes a few clips per step. With audio + 32 video frames, vLLM
# otherwise batches ~66 clips into one encoder forward and OOM-crashes the
# workers (kv_cache_usage was ~2% at crash -> it is encoder *activation*
# memory, not KV cache). 8 keeps the encoder batch small; eval throughput
# is not a concern.
max_num_seqs: 8
colocated:
enabled: true
resources:
gpus_per_node: null
num_nodes: null

tokenizer:
name: ${generation.model_name}
chat_template: "default"
chat_template_kwargs: null
video:
# 16 -> 32 frames: 60s clips at 16 frames is ~1 frame / 3.75s, too sparse
# for fine-grained temporal (Event Sequence) questions.
num_frames: 32

data:
max_input_seq_length: ${generation.vllm_cfg.max_model_len}
prompt_file: examples/prompts/daily_omni.txt
system_prompt_file: null
dataset_name: "daily-omni"
split: "train"
env_name: vlm

env:
vlm:
num_workers: 8
reward_functions:
- name: exact_alnum
weight: 1.0

cluster:
gpus_per_node: 1
num_nodes: 1
168 changes: 168 additions & 0 deletions examples/configs/intent_grpo_7B_megatron.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
# Intent (audio+video) GRPO 7B Megatron configuration.
#
# Trains Qwen/Qwen2.5-Omni-7B with GRPO on PhilipC/IntentTrain (intent
# recognition over short MER24 / social_iq video clips with audio) and runs
# in-training validation on PhilipC/IntentBench.
# * Audio and video reach the model as two independent multimodal items
# per prompt: the dataset emits {type: video} + {type: audio}, the chat
# template renders <|VIDEO|> and <|AUDIO|> placeholders, and vLLM
# rollouts pass them as multi_modal_data["video"] / multi_modal_data["audio"].
# use_audio_in_video=True / mm_processor_kwargs are NOT used because the
# installed transformers + vLLM Qwen2.5-Omni stack rejected that path.
# * Only problem_type == "multiple choice" samples are used; rewards reuse
# the audio recipe's format + exact_alnum.
#
# 7B requires more aggressive sharding than 3B to fit on 80 GB H100s alongside
# vLLM rollout memory:
# * tensor_model_parallel_size: 2 -> model state sharded across 2 ranks,
# data parallel size = gpus_per_node / TP = 4 with 8 GPUs.
# * per-forward batch must be exactly 1 sample/rank (train_micro_batch_size=1,
# logprob_batch_size=1), else the Qwen2.5-Omni get_rope_index path crashes
# with "IndexError: index 1 is out of bounds for dimension 0 with size 1".
# * num_frames 8 (vs the 3B recipe's 16) to roughly halve the prompt length
# and the training-forward activation memory.
# * activation_checkpointing on, vllm gpu_memory_utilization 0.4 to leave
# headroom for the Megatron forward.
#
# Inherits directly from grpo_math_1B_megatron.yaml (the same base the 3B
# recipe uses) and overrides intent-specific + 7B-specific settings.
defaults: "grpo_math_1B_megatron.yaml"

grpo:
num_prompts_per_step: 32
num_generations_per_prompt: 8
max_num_steps: 1000
val_at_start: false
max_val_samples: 256
val_batch_size: 32

checkpointing:
enabled: true
checkpoint_dir: results/intent_grpo_7B_megatron
keep_top_k: 10
# save_period 20: a 1-epoch (~85-step) 7B run is slow (~6 min/step) and
# previously hit the Slurm time limit at ~step 30 with checkpoints/ still
# EMPTY. 20 lands a checkpoint at steps 20/40/60/80. checkpoint_must_save_by
# additionally forces a save once 3h45m of wall-clock have elapsed so
# progress survives the job time limit (format DD:HH:MM:SS).
save_period: 20
checkpoint_must_save_by: "00:03:45:00"

policy:
model_name: Qwen/Qwen2.5-Omni-7B
# PER-FORWARD batch must be exactly 1 sample/rank, else the Qwen2.5-Omni
# get_rope_index path crashes with "IndexError: index 1 is out of bounds for
# dimension 0 with size 1" (input_ids batch > attention_mask batch). That is
# controlled by train_micro_batch_size=1 (train forward) and
# logprob_batch_size=1 (log-prob forward). train_global_batch_size=32 only
# sets gradient accumulation and must stay divisible by micro x DP
# (32 % (1 x DP=4) == 0).
train_global_batch_size: 32
train_micro_batch_size: 1
generation_batch_size: 32
logprob_batch_size: 1
# Audio + video produces materially more tokens than the audio-only recipe;
# this budget keeps loss_multiplier > 0 with headroom. The video frame count
# (tokenizer.video.num_frames) is the dominant lever on prompt length -- do
# not raise it (or switch to fps) without raising this too.
max_total_sequence_length: 8192

tokenizer:
video:
# 7B: 8 frames (vs the 3B recipe's 16) to roughly halve the prompt length
# (~7.3k -> ~4.5k tokens: 8x360 video + ~1.5k audio + text) and thus the
# training-forward activation memory. NOTE: stopgap -- the proper fix
# (matching HumanOmniV2, which only trains the LLM) is to FREEZE the
# vision/audio encoders, which needs a code hook (no YAML knob exists).
# DO NOT switch to fps-based sampling: fps=2 expands the clips to ~43k
# video tokens, blows past max_total_sequence_length / vLLM max_model_len,
# and vlm_hf_data_processor then empties the multimodal items
# (loss_multiplier=0). fps and num_frames are mutually exclusive.
num_frames: 8

sequence_packing:
enabled: false

generation:
max_new_tokens: 1024
vllm_cfg:
# Audio/multimodal models require tokenizer to be initialized before generation
skip_tokenizer_init: False
# 7B model state crowds the GPU; lower vLLM cache budget so Megatron has
# room for activations during the training-time forward pass.
gpu_memory_utilization: 0.4
limit_mm_per_prompt:
video: 1
audio: 1
vllm_kwargs:
# Disable mm processor cache to avoid vLLM cache eviction assertion error during validation.
mm_processor_cache_gb: 0

megatron_cfg:
converter_type: Qwen2_5OmniForConditionalGeneration
apply_rope_fusion: false
activation_checkpointing: true
# TP=2 (DP=4 on 8 GPUs) -- 2x the data-parallel throughput of TP=4. Valid
# TP values are 1/2/4 (num_attention_heads=28 must be divisible by TP; TP=8
# fails). At num_frames=8 (~4.5k-token sequence) the logits/activation
# memory is ~40% smaller than at 16 frames, so TP=2 fits. If it OOMs, fall
# back to tensor_model_parallel_size=4 (proven to run at 8 frames).
tensor_model_parallel_size: 2
optimizer:
lr: 1.0e-6
min_lr: 1.0e-7
scheduler:
lr_warmup_iters: 10
lr_warmup_init: 1.0e-7
distributed_data_parallel_config:
overlap_grad_reduce: false

data:
num_workers: 0
train:
dataset_name: intent-train
split: train
allowed_problem_types:
- "multiple choice"
validation:
dataset_name: intent-bench
split: validation
allowed_problem_types:
- "multiple choice"
default:
prompt_file: null
system_prompt_file: null
processor: "vlm_hf_data_processor"
env_name: "vlm"

env:
vlm:
num_workers: 8
# Strict two-signal reward (format + accuracy), same structure as the
# HumanOmniV2 reference. The IntentDataset prompt instructs the model to
# reason between <think> </think> and commit the answer between
# <answer> </answer> tags:
# * format -- rewards the <think>...</think><answer>...</answer>
# structure (does not gate correctness).
# * exact_alnum -- case-insensitive exact match on the <answer> content;
# returns 0 when the <answer> tag is missing, so the model
# must emit the wrapped form to earn the accuracy signal.
reward_functions:
- name: format
weight: 0.2
- name: exact_alnum
weight: 0.8

logger:
wandb_enabled: true
tensorboard_enabled: true
monitor_gpus: false
wandb:
project: grpo-dev
name: intent-grpo-7b-megatron
swanlab:
project: grpo-dev
name: intent-grpo-7b-megatron

cluster:
gpus_per_node: 8
1 change: 1 addition & 0 deletions examples/prompts/daily_omni.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{} First reason briefly between <think> </think> tags, then output only the single option letter (e.g., A, B, C, D, ...) between <answer> </answer> tags. Format example: <think>your reasoning</think><answer>A</answer>
Loading
Loading