llama: R-SWA reference sliding window attention for Unlimited-OCR#24975
Draft
sfallah wants to merge 4 commits into
Draft
llama: R-SWA reference sliding window attention for Unlimited-OCR#24975sfallah wants to merge 4 commits into
sfallah wants to merge 4 commits into
Conversation
4 tasks
Contributor
|
It seems the impl in this branch has numeric precision issue: @o7si could you please have a test on image in that discussion? Thanks. |
Contributor
Author
@jason-ni |
The DeepSeek-OCR / Unlimited-OCR decoder reads dense layout (e.g. tables) by attending over the always-visible visual prefix. With the default F16 V-cache, those value vectors are truncated enough to garble the output: table headers come out as """ / ">" (reported on ggml-org#24975), while the official HF reference parses them correctly. The HF reference accumulates attention in F32, so match it by promoting the F16 V-cache default to F32 for LLM_ARCH_DEEPSEEK2OCR. An explicit lower-precision -ctv (e.g. q8_0) is still honored. This is the in-graph equivalent of running with --cache-type-v f32. It is not the cuBLAS compute mode: the headers are emitted deep in autoregressive decode (a mat-vec path that bypasses cuBLAS), so FORCE_CUBLAS_COMPUTE_32F has no effect; F16 V storage/accumulation is what truncates. Verified on the reported image (parses cleanly with no flags) and with tools/mtmd/tests/test-deepseek-ocr.py: all cases pass and improve, no regression (v1 0.2626, v2 0.6877, unlimited 0.1641 CER). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_015dykwunMpwXWxHPVhbjhiK
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Scope: single-page parity.
Validation
The correctness is validated by reusing/extending a regression test that was introduced for DeepSeek-OCR (v1+2) against the HF reference implementation.
Single-page, bf16, against my HF reference scoring (transformers 4.46.3):
Design note
R-SWA lives in core: LLAMA_SWA_TYPE_REFERENCE + the mask rule + per-seq n_ref in llama_kv_cache, activated only for deepseek2-ocr. Alternative: implement in deepseek2-ocr model code. Open to moving it into the model code if it is preferred.
Anyhow my immediate aim is that Unlimited-OCR is supported in llama.cpp with its original R-SWA.
How to run
GGUF models: sabafallah/Unlimited-OCR-GGUF
build/bin/llama-mtmd-cli -hf sabafallah/Unlimited-OCR-GGUF:bf16 \ --image tools/mtmd/test-1.jpeg -p "document parsing." \ --chat-template deepseek-ocr \ --temp 0 --flash-attn off --no-warmup \ -n 4096 -c 16384 \ --dry-multiplier 0.8 --dry-base 1.75 --dry-allowed-length 2 \ --dry-penalty-last-n -1 --dry-sequence-breaker noneRequirements