llama: R-SWA reference sliding window attention for Unlimited-OCR by sfallah · Pull Request #24975 · ggml-org/llama.cpp

sfallah · 2026-06-24T14:32:21Z

Overview

Vibe-coded DRAFT to open the discussion about R-SWA support in llama.cpp.

Stacked on #24969 (the Unlimited-OCR converter) - that needs to be reviewed/merged first.

Adds baidu/Unlimited-OCR to mtmd: DeepSeek-OCR v1 with the original R-SWA decoder.
R-SWA: each token attends the full prefix (image + prompt) + a window over the last n_swa generated tokens; unlike plain SWA, the image stays visible.
New llama_swa_type REFERENCE; L_m latched implicitly in the KV cache at the prefill->decode boundary.
Single full KV cache (no eviction); n_ref maintained across the cache lifecycle ops.

Scope: single-page parity.

Validation

The correctness is validated by reusing/extending a regression test that was introduced for DeepSeek-OCR (v1+2) against the HF reference implementation.

Single-page, bf16, against my HF reference scoring (transformers 4.46.3):

	CER
llama.cpp	0.1397
HF reference	0.1869

Design note

R-SWA lives in core: LLAMA_SWA_TYPE_REFERENCE + the mask rule + per-seq n_ref in llama_kv_cache, activated only for deepseek2-ocr. Alternative: implement in deepseek2-ocr model code. Open to moving it into the model code if it is preferred.
Anyhow my immediate aim is that Unlimited-OCR is supported in llama.cpp with its original R-SWA.

How to run

GGUF models: sabafallah/Unlimited-OCR-GGUF

build/bin/llama-mtmd-cli -hf sabafallah/Unlimited-OCR-GGUF:bf16 \
  --image tools/mtmd/test-1.jpeg -p "document parsing." \
  --chat-template deepseek-ocr \
  --temp 0 --flash-attn off --no-warmup \
  -n 4096 -c 16384 \
  --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 \
  --dry-penalty-last-n -1 --dry-sequence-breaker none

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - this is largely an AI-written first draft, opened to start the discussion. I designed the R-SWA approach (from the model's paper), directed and reviewed the implementation; the code itself is mostly AI-generated. I take responsibility for the contents.

jason-ni · 2026-06-27T02:28:30Z

It seems the impl in this branch has numeric precision issue:
https://huggingface.co/sabafallah/Unlimited-OCR-GGUF/discussions/1

@o7si could you please have a test on image in that discussion? Thanks.

sfallah · 2026-06-27T07:26:28Z

It seems the impl in this branch has numeric precision issue: https://huggingface.co/sabafallah/Unlimited-OCR-GGUF/discussions/1

@o7si could you please have a test on image in that discussion? Thanks.

@jason-ni
yes, I will.
Thanks for reporting this.

The DeepSeek-OCR / Unlimited-OCR decoder reads dense layout (e.g. tables) by attending over the always-visible visual prefix. With the default F16 V-cache, those value vectors are truncated enough to garble the output: table headers come out as """ / ">" (reported on ggml-org#24975), while the official HF reference parses them correctly. The HF reference accumulates attention in F32, so match it by promoting the F16 V-cache default to F32 for LLM_ARCH_DEEPSEEK2OCR. An explicit lower-precision -ctv (e.g. q8_0) is still honored. This is the in-graph equivalent of running with --cache-type-v f32. It is not the cuBLAS compute mode: the headers are emitted deep in autoregressive decode (a mat-vec path that bypasses cuBLAS), so FORCE_CUBLAS_COMPUTE_32F has no effect; F16 V storage/accumulation is what truncates. Verified on the reported image (parses cleanly with no flags) and with tools/mtmd/tests/test-deepseek-ocr.py: all cases pass and improve, no regression (v1 0.2626, v2 0.6877, unlimited 0.1641 CER). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015dykwunMpwXWxHPVhbjhiK

sfallah added 3 commits June 24, 2026 12:03

mtmd: model: unlimited-ocr: converter + parity test

381c3d2

deepseek2-ocr: R-SWA reference sliding window attention

823dfaa

deepseek2-ocr: maintain R-SWA n_ref across kv-cache lifecycle ops

cc6e8e0

github-actions Bot added model Model specific examples python python script changes labels Jun 24, 2026

o7si mentioned this pull request Jun 25, 2026

Feature Request: Supports Unlimited-OCR #25009

Closed

4 tasks

github-actions Bot added mtmd Related to multimodal functionality (video/image/audio) conversion labels Jun 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama: R-SWA reference sliding window attention for Unlimited-OCR#24975

llama: R-SWA reference sliding window attention for Unlimited-OCR#24975
sfallah wants to merge 4 commits into
ggml-org:masterfrom
sfallah:sf/unlimited-ocr-rswa

sfallah commented Jun 24, 2026

Uh oh!

jason-ni commented Jun 27, 2026

Uh oh!

sfallah commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sfallah commented Jun 24, 2026

Overview

Validation

Design note

How to run

Requirements

Uh oh!

jason-ni commented Jun 27, 2026

Uh oh!

sfallah commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants