Follow-up items noted during review of PR #23.
Font coverage
PdfBuilder currently uses Helvetica (standard Type 1), which covers Windows-1252 (Western European). ALTO content in other scripts (Arabic, CJK, Cyrillic, etc.) would need a bundled Unicode TrueType font. Easy to add — embed a suitable TTF (e.g. Liberation Sans or Noto) via PdfFontFactory.CreateFont(fontPath, PdfEncodings.IDENTITY_H, EmbeddingStrategy.FORCE_EMBEDDED).
MemoryStream for generation
The whole PDF is built into a MemoryStream before being written to storage. For large manifests (300+ pages at 150 dpi) this could be tens of MB held in memory simultaneously. A better approach would write pages directly to a temp file and then copy that to storage, avoiding the in-memory spike. Low priority until large manifests are tested in practice.
Unit tests for PdfBuilder
There are no unit tests for PdfBuilder itself. Testing a PDF builder meaningfully requires either rendering the output (complex) or inspecting the iText object graph (tightly coupled to the library internals). The most practical coverage is an E2E test that downloads the PDF and verifies: page count matches canvas count, text is extractable (e.g. via PdfPig), and the rendering link appears in the text-augmented manifest. Worth adding once the E2E suite is extended.
Follow-up items noted during review of PR #23.
Font coverage
PdfBuildercurrently uses Helvetica (standard Type 1), which covers Windows-1252 (Western European). ALTO content in other scripts (Arabic, CJK, Cyrillic, etc.) would need a bundled Unicode TrueType font. Easy to add — embed a suitable TTF (e.g. Liberation Sans or Noto) viaPdfFontFactory.CreateFont(fontPath, PdfEncodings.IDENTITY_H, EmbeddingStrategy.FORCE_EMBEDDED).MemoryStream for generation
The whole PDF is built into a
MemoryStreambefore being written to storage. For large manifests (300+ pages at 150 dpi) this could be tens of MB held in memory simultaneously. A better approach would write pages directly to a temp file and then copy that to storage, avoiding the in-memory spike. Low priority until large manifests are tested in practice.Unit tests for
PdfBuilderThere are no unit tests for
PdfBuilderitself. Testing a PDF builder meaningfully requires either rendering the output (complex) or inspecting the iText object graph (tightly coupled to the library internals). The most practical coverage is an E2E test that downloads the PDF and verifies: page count matches canvas count, text is extractable (e.g. via PdfPig), and the rendering link appears in the text-augmented manifest. Worth adding once the E2E suite is extended.