Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 117 additions & 13 deletions openpdf-renderer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,22 +95,98 @@ in-tree legacy parser (`PDFFile`, `PDFPage`, `PDFParser`,
| Content-stream operator listing (`getContentOperators`) | `openpdf-core` (`PdfContentParser`) |
| Page rasterization (`renderPage`) | `openpdf-core` (`PdfContentParser`) → Java2D via `OpenPdfCorePageRenderer` |

The Java2D rasterizer (`OpenPdfCorePageRenderer`) supports the standard subset
of PDF operators needed for typical text + simple-vector PDFs: graphics state
(`q`/`Q`/`cm`), path construction (`m`/`l`/`c`/`v`/`y`/`re`/`h`), path
painting (`S`/`s`/`f`/`f*`/`B`/`B*`/`b`/`b*`/`n`), line width (`w`),
DeviceGray/DeviceRGB colors (`g`/`G`/`rg`/`RG`), and the full text-object
machinery (`BT`/`ET`/`Tf`/`Tc`/`Tw`/`TL`/`Tz`/`Td`/`TD`/`Tm`/`T*`/`Tj`/`TJ`/`'`/`"`).
Operators outside this subset (extended graphics state `gs`, CMYK / pattern /
shading colors, XObject `Do`, inline images, marked content, clipping
`W`/`W*`, ...) are parsed but currently ignored — pages that rely heavily on
them may render with missing content. Adding more operators is a localized
change in `OpenPdfCorePageRenderer`.

For pages that exercise features outside the supported subset and need
The Java2D rasterizer (`OpenPdfCorePageRenderer`) supports a broad subset of
PDF content-stream operators — sufficient for typical text + vector PDFs:

| Category | Operators |
|---|---|
| Graphics state | `q`, `Q`, `cm`, `gs` (alpha `CA`/`ca`, line styling `LW`/`ML`/`LC`/`LJ`/`D`, stroke-adjust `SA`) |
| Line style | `w` (including the PDF §8.4.3.2 zero-width hairline rule), `J`, `j`, `M`, `d`, `i` |
| Path construction | `m`, `l`, `c`, `v`, `y`, `re`, `h` |
| Path painting | `S`, `s`, `f`, `F`, `f*`, `B`, `B*`, `b`, `b*`, `n` |
| Clipping | `W`, `W*` |
| Colors (DeviceGray / DeviceRGB / DeviceCMYK) | `g`, `G`, `rg`, `RG`, `k`, `K`, `cs`, `CS`, `sc`, `SC`, `scn`, `SCN` |
| Text state | `BT`, `ET`, `Tf`, `Tc`, `Tw`, `TL`, `Tz`, `Td`, `TD`, `Tm`, `T*`, `Ts` |
| Text showing | `Tj`, `TJ`, `'`, `"` |
| XObjects | `Do` (see below) |

| Marked content / compatibility (no-op) | `BMC`, `BDC`, `EMC`, `MP`, `DP`, `BX`, `EX` |

XObject coverage:

- Form XObjects render recursively, applying their own `/Matrix` and `/BBox`
under the current CTM with full state save/restore.
- Image XObjects decode via `ImageIO` for JPEG (`DCTDecode`) and JPEG 2000
(`JPXDecode`, where the runtime supports it), and via a manual raster
builder for uncompressed / Flate-decoded 8-bit DeviceGray, DeviceRGB and
DeviceCMYK streams (CMYK approximated to sRGB on the fly). 8-bit Indexed
color images are expanded through their palette into the base color space
(DeviceGray / DeviceRGB / DeviceCMYK).

Text rendering: for each `Tf`-selected font, the renderer pulls the
embedded font program (`FontFile2`/`FontFile3`/`FontFile`) out of the
FontDescriptor and loads it via `java.awt.Font.createFont`. Embedded
TrueType fonts therefore render with their own glyph shapes. When a
font isn't embedded (or the embedded program can't be loaded), the
renderer falls back to a generic Java2D family picked by PostScript-name
heuristics — glyph widths from the PDF font are still respected,
but shapes are only approximate.

Tables: `OpenPdfCorePageRenderer` honors the PDF §8.4.3.2 zero-width hairline
rule (`w 0` strokes are rendered as one device pixel rather than collapsing to
nothing under the page CTM), reads dash patterns and the stroke-adjust flag
from ExtGState (`D`, `SA`), and enables Java2D `KEY_STROKE_CONTROL =
VALUE_STROKE_NORMALIZE` so that 0.5pt table borders snap to integer device
pixels instead of smearing across two rows of antialiased pixels. Full
`PdfPTable` output (cell-background fills, colored borders, header rows and
cell text) is exercised by the renderer's test suite.

Inline images (`BI`/`ID`/`EI`) are now rendered: a preprocess pass promotes
each inline image into a synthetic Image XObject (with JPEG framing detected
by the JPEG `FFD9` end-of-image marker when the filter is `DCTDecode` to
sidestep the ambiguous whitespace-bounded `EI` heuristic), then the rest of
the renderer treats it like any other XObject. Uncompressed, Flate-decoded
and JPEG inline images are supported. Shading (`sh`), pattern / shading
colors and type 3 font glyph operators are silently ignored. Pages that
rely heavily on those features may render with missing content. Adding more
operators is a localized change in `OpenPdfCorePageRenderer`.

For pages that need features outside this supported subset and you want
pixel-perfect output today, the deprecated `PDFFile` / `PDFPage.getImage(...)`
API still works.

### Honest limitations & roadmap

`OpenPdfCoreRenderer` is intentionally a focused, lightweight renderer.
The legacy in-tree parser still wins on real-world PDFs that exercise:

- **Embedded Type 1 / CFF / OpenType-CFF fonts.** `Font.createFont` only
loads TrueType reliably; `FontFile3` (CFF/OpenType) is attempted but
often falls back to the name-heuristic path. Subsetted TrueType fonts
with non-Unicode CMaps draw `.notdef` for codes their `cmap` table
doesn't list. Real fix: drive glyph dispatch from the PDF's encoding /
CMap to glyph IDs and render via `Font#createGlyphVector(int[])`.
- **Type 3 fonts.** Glyph operators (`d0`, `d1` + nested content streams)
are ignored.
- **Color management.** CMYK uses the textbook `(1-c)(1-k)` approximation;
no ICC profile, no UCR/BG. Anything color-managed will look noticeably
wrong. Real fix: respect the ICCBased profile via `java.awt.color.ICC_Profile`.
- **Pattern and shading paint** (`pattern`, `sh`). Ignored.
- **Soft masks (`SMask`) and transparency groups.** Ignored; image alpha
honors `ca` only, not per-pixel masks.
- **Separation / DeviceN color spaces** for images and paths. Ignored; falls
back to filling with the color-space default. (Indexed images are now
supported.)
- **Sub-byte bit depths** (1/2/4-bit indexed images, 1-bit image masks).
Currently only 8-bit indices are decoded.
- **Encrypted PDFs.** Out of scope for this module (see "Encryption: removed"
below).

These gaps are why the legacy `PDFFile` / `PDFPage` path remains the
production renderer for the time being. Each item above is a fairly
localized addition to `OpenPdfCorePageRenderer`; the order above is
roughly highest-impact first.

## Quick Start

### Basic PDF to Image Conversion
Expand Down Expand Up @@ -155,6 +231,34 @@ try (OpenPdfCoreRenderer renderer = new OpenPdfCoreRenderer(new File("document.p
}
```

### Rendering directly to a `Graphics2D`

Avoid the intermediate `BufferedImage` when the caller already has a target
surface (Swing component, printer, SVG-backed graphics, ...):

```java
try (OpenPdfCoreRenderer renderer = new OpenPdfCoreRenderer(new File("document.pdf"))) {
BufferedImage out = new BufferedImage(800, 1000, BufferedImage.TYPE_INT_ARGB);
Graphics2D g2 = out.createGraphics();
try {
renderer.renderPage(1, g2, 800, 1000); // fit page to the box, preserve aspect
} finally {
g2.dispose();
}
}
```

### Batch rendering

```java
try (OpenPdfCoreRenderer renderer = new OpenPdfCoreRenderer(new File("document.pdf"))) {
List<BufferedImage> pages = renderer.renderAllPages(150f);
for (int i = 0; i < pages.size(); i++) {
ImageIO.write(pages.get(i), "png", new File("page-" + (i + 1) + ".png"));
}
}
```

## Using the legacy `PDFFile` / `PDFPage` API (deprecated)

The pre-3.0.5 entry point still works but is now `@Deprecated`. New code should
Expand Down
Loading
Loading