Skip to content

InternLM/RNGBench

Repository files navigation

Beyond the Current Observation: Evaluating Multimodal Language Models in Non-Markov Games

RNG-Bench · Reconstructive Non-Markov Games

Shengyuan Ding1,2,3,*  ·  Xilin Wei1,*  ·  Xinyu Fang4,*  ·  Haodong Duan5,†
Dahua Lin3,5  ·  Jiaqi Wang2,†  ·  Yuhang Zang3,†

1Fudan University   2Shanghai Innovation Institute   3Shanghai AI Laboratory
4Zhejiang University   5The Chinese University of Hong Kong
*Equal contribution   Corresponding authors

arXiv Project Page HF Dataset HF Daily Paper License

RNG-Bench overview: Markov vs. Non-Markov, and the two-game test suite

Many real decisions hinge on something no longer on screen — a card seen a few turns ago, a corridor already walked. We call this the Non-Markov regime: the current observation is not a sufficient statistic, so a model must reconstruct the relevant hidden state from its history before it acts, and a single recall error changes what it sees next. RNG-Bench isolates this remember-to-act ability in closed loop, under controlled difficulty, with two complementary games.

Two complementary games

  • 🎴 Matching Pairsstatic, categorical hidden state. Card identities are revealed for a single turn and must later be recalled by location.
  • 🧭 3D Mazedynamic, spatial hidden state. Egocentric first-person views must be assembled into a map to reach the goal.

Both run under one harness and one strict parser, so a score drop reflects belief-state tracking rather than rule misunderstanding or action formatting.

Highlights

  • Remember-to-act, in closed loop. The model acts every turn on observations that have since disappeared, and a recall error reshapes the next observation — unlike memory benchmarks that ask a single post-hoc question.
  • Controlled difficulty. Independently vary hidden-state scale (grid / map size), visual pattern, and observation modality (text vs. image), with everything else held fixed.
  • Duel protocol. Two models alternate on the same board, cancelling instance variance and rewarding exploitation of opponent-revealed cards.
  • Long-context stress test. The hardest configurations reach ~128K tokens and ~350 image inputs per episode, and scale further with size.
  • Far from saturation. The best 10×10 image Matching Pairs score is 62.3% (GPT-5.4); the best 13×13 maze success rate is 50% (Gemini-3.1-Pro).

What's in this repo

.
├── model_presets.py         # Single shared model registry (both games)
├── framework/               # Shared eval harness: LLM client, runner, game, types
├── 1_matching_pairs_new/    # Matching Pairs environment + eval runner (see its README)
│   ├── env/                 # Game logic, board rendering (text / image themes)
│   ├── modes/               # single_normal / single_noaction / dual_normal / dual_noaction
│   ├── scripts/             # Example sweep launchers
│   └── assets/              # Card themes: poker, noise, textures, perlin, ...
├── 2_3d_maze/               # 3D Maze environment + eval runner (see its README)
│   ├── game.py              # DFS maze gen + raycast renderer
│   ├── runner.py            # Episode loop (binds to framework LLM client)
│   ├── run.py               # CLI
│   └── scripts/             # Example eval launcher
├── docs/                    # Project homepage (GitHub Pages)
├── .env.example             # Copy to .env; API keys / endpoints
└── requirements.txt

Both games share one model registry (model_presets.py) and one eval harness (framework/); adding a model in either takes effect everywhere. The data-generation engine used for the training experiments below is not part of this release yet.


Quick Start

Install

git clone https://github.com/InternLM/RNGBench.git
cd RNGBench
conda create -n rngbench python=3.10 -y && conda activate rngbench
pip install -r requirements.txt   # openai httpx pillow numpy python-dotenv + pandas/seaborn/matplotlib (visualize)

Configure an OpenAI-compatible endpoint

Both games read API keys/endpoints from a single repo-root .env (loaded automatically by every entry point). Copy the template and fill in what you use:

cp .env.example .env
Env var Used by
OPENAI_API_BASE / OPENAI_API_KEY Default OpenAI-compatible endpoint — OpenAI, or a self-hosted vLLM / lmdeploy / Ollama server (the Qwen / Kimi / gpt-5.4 presets).
GEMINI_API_KEY Google Gemini native API (the gemini-3.1-pro presets).
ARK_API_KEY Volcengine Ark / Doubao Seed (the seed-2.0 presets).
NON_MARKOV_SAMPLE_SEED Optional reproducible sampler seed (presets opt in via sample_seed_env).

A model preset only needs whichever key its endpoint uses — you do not need all of them. The endpoints in model_presets.py are illustrative placeholders (localhost / example hosts); point them at your own deployment.

Add or change a model

All models live in the repo-root model_presets.py, shared by both games. A preset maps a short name → endpoint + sampling. The minimal form points at your default .env endpoint:

MODEL_PRESETS = {
    "my-model": {
        "model": "served-model-name",          # name your server exposes
        "api_base": "http://localhost:8000/v1", # or omit to use OPENAI_API_BASE
        "api_key_env": "OPENAI_API_KEY",        # env var holding the key
        "extra_params": {"temperature": 0.8, "max_tokens": 32768},
    },
}

Then pass --model my-model to any mode. See the existing entries for vLLM/ lmdeploy, Ark, and gateway examples.

Run Matching Pairs

cd 1_matching_pairs_new

# Single-player, image board, 8×10 grid, noise pattern
python -m modes.single_normal \
  --model gpt-5.4 \
  --grid 8x10 --render image --theme noise \
  --seed 0 --max-resp-per-pair 5 \
  --out results_demo

# Duel: two models alternate on the same board
python -m modes.dual_normal \
  --model-a gpt-5.4 --model-b gemini-3.1-pro \
  --grid 8x10 --render-a image --render-b image --theme poker \
  --seed 0 --out results_demo

# Text mode (no images)
python -m modes.single_normal \
  --model gpt-5.4 --grid 8x10 --render text \
  --seed 0 --out results_demo

Each run writes game.json (trajectory) and images/round_*.png (per-round renders) under results_demo/<mode>/<model>/<theme>/<grid>/seed_<S>/.

Run 3D Maze

cd 2_3d_maze

# 11×11 maze, 3D first-person view
python run.py --model gpt-5.4 --maze-size 11 --seed 0

# Sweep five seeds
python run.py --model gpt-5.4 --seeds 0,1,2,3,4 --maze-size 13

# Preview without calling any LLM
python run.py --model dummy --preview --seed 0 --maze-size 11

# Memory Gap: --minimap shows the true map every step (oracle); the score drop
# vs. the normal run isolates spatial recall from perception / decision-making.
python run.py --model gpt-5.4 --maze-size 13 --seed 0 --minimap

Main Results

No frontier system is close to saturation.

Single-player. Two separate tables, one per game. Best per column in bold.

Matching Pairs (10×10, image, noise theme) — Score% = fraction of matched pairs; Resp./Score = responses per matched pair; PF/IA = parse-failure / invalid-action rates.

Model PF%↓ IA%↓ Resp./Score↓ Score%↑
GPT-5.4 0.0 4.3 8.01 62.3
Gemini-3.1-Pro 0.4 2.5 10.00 50.0
Seed-2.0-Lite 1.2 4.3 11.57 43.2
Kimi-K2.5 1.8 2.8 13.16 38.0
Qwen3.5-397B 0.0 3.0 19.74 25.3

3D Maze (13×13, no minimap, mean optimal path 60 steps) — GS% = aggregate score (success rate, efficiency, exploration); Eff. is over successful episodes only.

Model SR%↑ Explore%↑ Walls↓ Eff.%↑ GS%↑
GPT-5.4 20.0 32.3 3.2 75.7 30.5
Gemini-3.1-Pro 50.0 36.4 0.1 62.5 49.7
Seed-2.0-Lite 20.0 19.4 16.6 38.9 21.7
Kimi-K2.5 10.0 17.9 7.1 61.1 16.1
Qwen3.5-397B 0.0 21.0 9.9 0.0 10.5

Duel — Matching Pairs, image (poker), each model plays 16 games vs. the other four (both player orders, two seeds). The ranking diverges from single-player: Gemini-3.1-Pro wins every matchup, exploiting cards revealed by the opponent.

Model Win%↑ W T L Score%↑ ELO↑
Gemini-3.1-Pro 100.0 16 0 0 36.5 1803
GPT-5.4 50.0 7 2 7 25.3 1492
Qwen3.5-397B 46.7 7 1 8 18.0 1476
Kimi-K2.5 37.5 5 2 9 18.0 1423
Seed-2.0-Lite 15.6 2 1 13 12.3 1306

Key findings. Performance drops sharply with scale (Qwen3.5-397B: 90.6% → 0.7% from 4×4 to 12×12). Vision is the bottleneck, not history length — Qwen3.5-397B and Kimi-K2.5 solve Matching Pairs perfectly in text but fall to 38.3% / 43.3% under noise-pattern images. The textual action trace is load-bearing — removing it collapses GPT-5.4 from 62.3% to 15.3% even though every flip is visible in the board image. And there is large headroom: an optimal policy needs only 3.24 responses per matched pair vs. 8.01 for the best model. Full ablations are in the paper.


Sweeps

Each game ships example launchers under its scripts/ directory. They build a model × config × seed matrix and run it in parallel; everything is overridable via env vars, and PARALLEL (a thin wrapper over xargs -P) is the concurrency knob — set it to whatever your API rate limit allows.

# Matching Pairs — see 1_matching_pairs_new/scripts/
MODELS="gpt-5.4 gemini-3.1-pro" GRIDS="8x10 10x10" PARALLEL=4 \
  bash 1_matching_pairs_new/scripts/run_eval.sh        # single-player matrix
MODEL_A=gpt-5.4 MODEL_B=gemini-3.1-pro SEEDS="0 1 2 3" \
  bash 1_matching_pairs_new/scripts/run_duel.sh        # duel protocol

# 3D Maze — see 2_3d_maze/scripts/
MODELS="gpt-5.4 gemini-3.1-pro" SIZES="9 11 13" SEEDS="0 1 2 3 4" PARALLEL=4 \
  bash 2_3d_maze/scripts/run_eval.sh

See each game's README.md for the full flag reference.


Citation

@article{rngbench2026,
  title   = {Beyond the Current Observation: Evaluating Multimodal Language Models in Non-Markov Games},
  author  = {Ding, Shengyuan and Wei, Xilin and Fang, Xinyu and Duan, Haodong and
             Lin, Dahua and Wang, Jiaqi and Zang, Yuhang},
  journal = {arXiv preprint arXiv:2606.19338},
  year    = {2026},
}

License

Code is released under the MIT License (see LICENSE). The card-asset themes under 1_matching_pairs_new/assets/ retain their original licenses (see assets/<theme>/LICENSE where applicable).

Acknowledgements

We thank the maintainers of LLaMA-Factory, vLLM, and the open-weight model teams whose checkpoints we evaluated.

About

Official Implementation of "Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors