Skip to content

llama: R-SWA reference sliding window attention for Unlimited-OCR#24975

Draft
sfallah wants to merge 4 commits into
ggml-org:masterfrom
sfallah:sf/unlimited-ocr-rswa
Draft

llama: R-SWA reference sliding window attention for Unlimited-OCR#24975
sfallah wants to merge 4 commits into
ggml-org:masterfrom
sfallah:sf/unlimited-ocr-rswa

Conversation

@sfallah

@sfallah sfallah commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Overview

Vibe-coded DRAFT to open the discussion about R-SWA support in llama.cpp.

Stacked on #24969 (the Unlimited-OCR converter) - that needs to be reviewed/merged first.

  • Adds baidu/Unlimited-OCR to mtmd: DeepSeek-OCR v1 with the original R-SWA decoder.
  • R-SWA: each token attends the full prefix (image + prompt) + a window over the last n_swa generated tokens; unlike plain SWA, the image stays visible.
  • New llama_swa_type REFERENCE; L_m latched implicitly in the KV cache at the prefill->decode boundary.
  • Single full KV cache (no eviction); n_ref maintained across the cache lifecycle ops.

Scope: single-page parity.

Validation

The correctness is validated by reusing/extending a regression test that was introduced for DeepSeek-OCR (v1+2) against the HF reference implementation.

Single-page, bf16, against my HF reference scoring (transformers 4.46.3):

CER
llama.cpp 0.1397
HF reference 0.1869

Design note

R-SWA lives in core: LLAMA_SWA_TYPE_REFERENCE + the mask rule + per-seq n_ref in llama_kv_cache, activated only for deepseek2-ocr. Alternative: implement in deepseek2-ocr model code. Open to moving it into the model code if it is preferred.
Anyhow my immediate aim is that Unlimited-OCR is supported in llama.cpp with its original R-SWA.

How to run

GGUF models: sabafallah/Unlimited-OCR-GGUF

build/bin/llama-mtmd-cli -hf sabafallah/Unlimited-OCR-GGUF:bf16 \
  --image tools/mtmd/test-1.jpeg -p "document parsing." \
  --chat-template deepseek-ocr \
  --temp 0 --flash-attn off --no-warmup \
  -n 4096 -c 16384 \
  --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 \
  --dry-penalty-last-n -1 --dry-sequence-breaker none

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - this is largely an AI-written first draft, opened to start the discussion. I designed the R-SWA approach (from the model's paper), directed and reviewed the implementation; the code itself is mostly AI-generated. I take responsibility for the contents.

@github-actions github-actions Bot added model Model specific examples python python script changes labels Jun 24, 2026
@jason-ni

Copy link
Copy Markdown
Contributor

It seems the impl in this branch has numeric precision issue:
https://huggingface.co/sabafallah/Unlimited-OCR-GGUF/discussions/1

@o7si could you please have a test on image in that discussion? Thanks.

@sfallah

sfallah commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

It seems the impl in this branch has numeric precision issue: https://huggingface.co/sabafallah/Unlimited-OCR-GGUF/discussions/1

@o7si could you please have a test on image in that discussion? Thanks.

@jason-ni
yes, I will.
Thanks for reporting this.

The DeepSeek-OCR / Unlimited-OCR decoder reads dense layout (e.g. tables)
by attending over the always-visible visual prefix. With the default F16
V-cache, those value vectors are truncated enough to garble the output:
table headers come out as """ / ">" (reported on ggml-org#24975), while the
official HF reference parses them correctly.

The HF reference accumulates attention in F32, so match it by promoting the
F16 V-cache default to F32 for LLM_ARCH_DEEPSEEK2OCR. An explicit
lower-precision -ctv (e.g. q8_0) is still honored. This is the in-graph
equivalent of running with --cache-type-v f32. It is not the cuBLAS compute
mode: the headers are emitted deep in autoregressive decode (a mat-vec path
that bypasses cuBLAS), so FORCE_CUBLAS_COMPUTE_32F has no effect; F16 V
storage/accumulation is what truncates.

Verified on the reported image (parses cleanly with no flags) and with
tools/mtmd/tests/test-deepseek-ocr.py: all cases pass and improve, no
regression (v1 0.2626, v2 0.6877, unlimited 0.1641 CER).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_015dykwunMpwXWxHPVhbjhiK
@github-actions github-actions Bot added mtmd Related to multimodal functionality (video/image/audio) conversion labels Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conversion examples model Model specific mtmd Related to multimodal functionality (video/image/audio) python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants