Skip to content

PDF derivative: known limitations and follow-up work #24

Description

@tomcrane

Follow-up items noted during review of PR #23.

Font coverage

PdfBuilder currently uses Helvetica (standard Type 1), which covers Windows-1252 (Western European). ALTO content in other scripts (Arabic, CJK, Cyrillic, etc.) would need a bundled Unicode TrueType font. Easy to add — embed a suitable TTF (e.g. Liberation Sans or Noto) via PdfFontFactory.CreateFont(fontPath, PdfEncodings.IDENTITY_H, EmbeddingStrategy.FORCE_EMBEDDED).

MemoryStream for generation

The whole PDF is built into a MemoryStream before being written to storage. For large manifests (300+ pages at 150 dpi) this could be tens of MB held in memory simultaneously. A better approach would write pages directly to a temp file and then copy that to storage, avoiding the in-memory spike. Low priority until large manifests are tested in practice.

Unit tests for PdfBuilder

There are no unit tests for PdfBuilder itself. Testing a PDF builder meaningfully requires either rendering the output (complex) or inspecting the iText object graph (tightly coupled to the library internals). The most practical coverage is an E2E test that downloads the PDF and verifies: page count matches canvas count, text is extractable (e.g. via PdfPig), and the rendering link appears in the text-augmented manifest. Worth adding once the E2E suite is extended.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions