fix(cli): harden CSV formula-injection escape (sec)#119
Open
sroussey wants to merge 1 commit into
Open
Conversation
… formula injection (sec) escapeCsvValue's `^[=+\-@\t\r]/` check missed three classes of bypass: leading ASCII whitespace, U+00A0 NBSP, and dangerous chars after embedded newlines inside a quoted multi-line cell (Excel/Sheets re-parse each physical line). A SEC-supplied issuer name like ` =cmd|'/c calc'!A0` passed through unprefixed. Now strips leading whitespace (incl. NBSP) before the dangerous-lead test and defuses every line after `\n`/`\r\n` independently. Regression test matrix covers all six leading chars (incl. TAB/CR), leading space, NBSP, and LF/CRLF multi-line cases.
sroussey
added a commit
that referenced
this pull request
Jun 2, 2026
Layers two refinements on top of the line-by-line CSV defusing in #119: 1. Tighten LEADING_WS so it only strips space-like characters that spreadsheets silently ignore (ASCII space, NBSP, SHY, ZWSP, ZWNJ, ZWJ, LRM, RLM, BOM). \t and \r are themselves dangerous formula leads, so we no longer strip them away before the DANGEROUS_LEAD check — otherwise "\t=cmd" or "\r=cmd" would slip through as non-dangerous after the strip. 2. Split on /(\r\n|\r|\n)/ instead of /(\r?\n)/ so that a bare CR inside a multi-line cell is also a line boundary; the line after the CR is independently defused. Excel re-parses every physical line of a quoted cell, including lines separated by lone CR. Tests cover the zero-width-prefix bypasses (ZWSP/ZWNJ/ZWJ/LRM/RLM/SHY/BOM + "=cmd"), mixed-WS bypasses ("ZWSP space =cmd"), bare-CR-followed-by-formula ("safe\r=cmd"), and a negative control to prove ZWSP-then-benign is left alone. All 44 cases in TableRenderer.test.ts pass. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
sroussey
added a commit
that referenced
this pull request
Jun 3, 2026
Layers two refinements on top of the line-by-line CSV defusing in #119: 1. Tighten LEADING_WS so it only strips space-like characters that spreadsheets silently ignore (ASCII space, NBSP, SHY, ZWSP, ZWNJ, ZWJ, LRM, RLM, BOM). \t and \r are themselves dangerous formula leads, so we no longer strip them away before the DANGEROUS_LEAD check — otherwise "\t=cmd" or "\r=cmd" would slip through as non-dangerous after the strip. 2. Split on /(\r\n|\r|\n)/ instead of /(\r?\n)/ so that a bare CR inside a multi-line cell is also a line boundary; the line after the CR is independently defused. Excel re-parses every physical line of a quoted cell, including lines separated by lone CR. Tests cover the zero-width-prefix bypasses (ZWSP/ZWNJ/ZWJ/LRM/RLM/SHY/BOM + "=cmd"), mixed-WS bypasses ("ZWSP space =cmd"), bare-CR-followed-by-formula ("safe\r=cmd"), and a negative control to prove ZWSP-then-benign is left alone. All 44 cases in TableRenderer.test.ts pass. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
sroussey
added a commit
that referenced
this pull request
Jun 4, 2026
…SP + download leak (sec) (#121) * fix(cli): complete CSV escape hardening with unicode + bare-CR support Layers two refinements on top of the line-by-line CSV defusing in #119: 1. Tighten LEADING_WS so it only strips space-like characters that spreadsheets silently ignore (ASCII space, NBSP, SHY, ZWSP, ZWNJ, ZWJ, LRM, RLM, BOM). \t and \r are themselves dangerous formula leads, so we no longer strip them away before the DANGEROUS_LEAD check — otherwise "\t=cmd" or "\r=cmd" would slip through as non-dangerous after the strip. 2. Split on /(\r\n|\r|\n)/ instead of /(\r?\n)/ so that a bare CR inside a multi-line cell is also a line boundary; the line after the CR is independently defused. Excel re-parses every physical line of a quoted cell, including lines separated by lone CR. Tests cover the zero-width-prefix bypasses (ZWSP/ZWNJ/ZWJ/LRM/RLM/SHY/BOM + "=cmd"), mixed-WS bypasses ("ZWSP space =cmd"), bare-CR-followed-by-formula ("safe\r=cmd"), and a negative control to prove ZWSP-then-benign is left alone. All 44 cases in TableRenderer.test.ts pass. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc * fix(storage,task): close observation TOCTOU + bootstrap zip leak Three related correctness fixes: 1. PersonObservationRepo / CompanyObservationRepo upsertByNaturalKey had a TOCTOU window — two concurrent upserts on an empty in-memory store would both observe `size()` as 0 and assign observation_id=1 to two different natural keys. Introduce a tiny process-wide AsyncMutex in src/util/AsyncMutex.ts that serializes the INSERT branch only (the existing-row update branch is keyed by observation_id and remains lock-free). Inside the critical section we re-query the natural key before allocating an id so that a parallel upsert of the same natural key still collapses to one row and one id. 2. BootstrapDownloadTask staged the multi-GB zip into SEC_RAW_DATA_FOLDER and only removed it on the success branch. If Bun.spawn threw, or unzip exited non-zero, the archive leaked until the next bootstrap run, silently consuming disk on operator VMs. Wrap the spawn+exit-code block in try/finally and remove the zip (with `force: true`) on every exit path. 3. AsyncMutex itself: trivial queue-based serializer. Rejections in one critical section don't poison the queue because the chained tracker swallows both branches; each caller still observes its own rejection through the returned promise. Tests: - PersonObservationRepo.test.ts / CompanyObservationRepo.test.ts each gain two new cases: concurrent INSERT-on-empty must yield distinct sequential ids, and concurrent UPDATE-existing + parallel INSERT must keep the original id stable and give the insert the next id. - BootstrapDownloadTask.test.ts gains an "execute zip cleanup" describe block that stubs fetch + Bun.spawn + Bun.which and asserts the zip is removed on (a) spawn-throws, (b) unzip-exits-nonzero, and (c) the success path. All 24 affected tests pass. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc * fix(forms): stop fabricating 0 for empty exempt-offering numerics (1/5) Plan H part 1 of 5 — split across multiple commits due to the push-files tool size limit; logically one fix. See later commits for Form_1_A tests, Form_1_K, Form_1_Z, and Form_C. Adds the shared _valueHelpers module (numScalar/strScalar/numWrapped/ strWrapped) plus its unit tests. numScalar treats empty / whitespace- only / NaN / Infinity input as null and rejects thousand-separator strings; legitimate "0" round-trips so the regression guard holds. NOTE: this is an inline copy of the helpers introduced by PR #118 under src/sec/forms/insider-trading/_valueHelpers.ts. When #118 merges the duplicate should be removed in favour of a single shared module. Also switches Form_1_A.schema.ts decimal-type aliases from Type.Number() to Type.String() (4 aliases) so the storage layer can make the null-vs-zero decision per cell with numScalar() instead of Value.Convert silently producing 0 for empty text. Form_1_A.storage.ts is updated to numScalar() every decimal field in processFinancialData and the RegAOfferingHistory build, and the extractor_version is bumped 1.0.0 → 1.1.0 to force re-extraction. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc * fix(forms): stop fabricating 0 for empty exempt-offering numerics (2/5) Plan H part 2 of 5 — Form_1_A test updates. Form_1_A.test.ts: decimal-typed leaves now arrive as raw strings from the parser; assertions updated accordingly (pricePerSecurity, audit/ legal/etc. fees). New "validate financial data types and ranges" split into decimal-string vs integer-number checks. Form_1_A.storage.test.ts: three new regression tests pin the null / empty / "0" round-trip behaviour through processForm1A. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc * fix(forms): stop fabricating 0 for empty exempt-offering numerics (3/5) Plan H part 3 of 5 — Form_1_K. Schema: 2 DECIMAL_TYPE aliases switch from Type.Number() to Type.String(). Storage (processOfferingHistory): price_per_security, aggregate_offering_price, aggregate_offering_price_holders, and estimated_net_amount now flow through numScalar(). extractor_version bumped 1.0.0 → 1.1.0 with the standard re-extract comment. Tests: parsing assertions updated for the new string contract; new storage regression tests for null/empty/"0" round-trip. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc * fix(forms): stop fabricating 0 for empty exempt-offering numerics (4/5) Plan H part 4 of 5 — Form_1_Z. Schema: 2 DECIMAL_TYPE aliases switch from Type.Number() to Type.String(). Storage: processOfferingSummaries now flows price_per_security, portionSecuritiesSoldIssuer/Securityholders, and issuerNetProceeds through numScalar(). processCertificationSuspension's approxRecordHolders also goes through numScalar() now (with a null-check instead of an undefined-check so the row is dropped when the input is empty/whitespace). extractor_version 1.0.0 → 1.1.0. Tests: parsing assertions updated for string contract; new storage regression tests for null/empty/"0" round-trip. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc * fix(forms): stop fabricating 0 for empty exempt-offering numerics (5/5) Plan H part 5 of 5 — Form_C and final wrap-up. Schema: 4 DECIMAL_TYPE aliases switch from Type.Number() to Type.String() — including the previously minimum-0 DECIMAL_TYPE7_2_ NONNEGATIVE. The minimum-0 invariant is no longer expressible at the schema level; that constraint becomes the extractor's responsibility (see Form_C.storage.ts numScalar() use). Storage: - processOfferingInfo now flows price, offering_amount, and maximum_offering_amount through numScalar(); priceDeterminationMethod remains a passthrough. - processAnnualReportDisclosures now flows every disclosure field through numScalar() and drops the row when the result is null, instead of writing a fabricated 0 disclosure_value. - extractor_version 1.0.0 → 1.1.0. Tests: parsing assertions updated for the new string contract on price/offeringAmount/maximumOfferingAmount/currentEmployees/ totalAssetMostRecentFiscalYear/revenueMostRecentFiscalYear. New storage regression tests cover null/empty/"0" round-trip for both offering and disclosure numerics. Whole-suite verification (bun test) passes 845/845 with the helpers, schemas, storage, and tests all in place. https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc * fix(cli): broaden CSV LEADING_WS to cover all spreadsheet-stripped invisibles The LEADING_WS character class used to strip a handful of zero-width characters before the formula-injection check, but missed common spreadsheet-stripped invisibles (U+00AD SHY, U+034F CGJ, U+061C ALM, U+115F/U+1160 Hangul fillers, U+1680 Ogham space, U+180E MVS, the bidi formatting block U+202A..U+202E, U+205F MMSP, the invisible operators U+2061..U+2064 / U+206A..U+206F, U+3164 Hangul filler, U+FFA0 halfwidth Hangul filler). Each of those can prefix a `=`/`+`/`-`/`@` cell that Excel/Sheets/Numbers strip before formula parsing. The previous source also kept raw invisible characters inline, which editors silently normalise. The regex is now built from `\uXXXX` escape sequences so the codepoint set is reviewable. The pre-existing test "defuses leading U+00A0 NBSP before '='" passed ASCII SPACE to escapeCsvValue() — vacuous because ASCII space was already stripped. The test now uses the U+00A0 escape and we add a table-driven regression for every codepoint LEADING_WS covers, plus NBSP-prefixed variants of every DANGEROUS_LEAD codepoint. https://claude.ai/code/session_011KMd9sERp2rguyekAi8a3u * fix(task): clean up partial bulk-download on stream abort or size mismatch streamDownloadToFile previously left a half-written archive on disk if the fetch body errored mid-stream. The next bootstrap run would then hand that partial multi-GB ZIP straight to unzip and either silently produce a corrupt extract or fail in a way that masked the real cause. Three fixes: - await writer.write(value) so back-pressure is honoured. The previous code fired-and-forgot, which let bytes pile up in the writer queue. - On any error path (network abort, writer failure, size mismatch), rmSync(destPath, { force: true }) before rethrowing. Best-effort — the original error is the one callers see. - Validate `bytes === content-length` when the server advertised one. A short response is now a hard failure instead of a truncated archive. The reader lock is released before the writer is closed so any pending flush doesn't deadlock against a still-locked stream. Tests cover the mid-stream abort and content-length mismatch paths, asserting both the throw AND that the destination file is gone. https://claude.ai/code/session_011KMd9sERp2rguyekAi8a3u * fix(observation): delegate observation_id assignment to the storage backend The old INSERT path computed `(await repo.size()) + 1` for the new row's observation_id, guarded by a process-wide AsyncMutex to close the size()/put() TOCTOU window. That was wrong in two ways: 1. The mutex was only effective in a single process. Two `sec` processes — or two workers — sharing the same SQLite/Postgres backend would still collide. 2. `repo.size()` reads the live row count, so a delete elsewhere in the table would silently reassign an id to a new natural key. Replace the whole dance with `x-auto-generated: true` on the observation_id schema field. The storage backend assigns the id (SQLite INTEGER PRIMARY KEY AUTOINCREMENT, Postgres SERIAL, in-memory counter) and returns the persisted row. `put()`'s return value gives us back the assigned id without a second query. The repo-side AsyncMutex import is dropped from both observation repos. Tests previously asserted specific id values (`expect(...).toBe(1)`). Auto-generated keys do not contract for "first id is 1" across backends (Postgres SERIAL can skip on rollback), so the assertions are now `toBeGreaterThan(0)` + `.not.toBe(other)`. EntityObserver tests get the same treatment. A smoke-test against InMemoryTabularStorage confirmed the backend does return the auto-generated id from put() — the natural-key PK fallback documented in the plan was not needed. https://claude.ai/code/session_011KMd9sERp2rguyekAi8a3u * fix(resolver): serialize person/company find-or-create per natural key Two parallel `resolver.resolve(obs)` calls that mapped to the same (resolver_version, key_kind, key_value) used to race: both observed no canonical row, both minted a fresh UUID via randomUUID(), both inserted, yielding two distinct canonical ids for what should have been the same entity. Every downstream identity-link that pointed at the duplicate canonical was then silently wrong. PersonResolver and CompanyResolver now hold a process-wide Map<string, { mutex: AsyncMutex; refs: number }> keyed on the resolver_version-plus-natural-key tuple. The find-or-create critical section runs inside `mutex.lock()` so a queued caller observes the canonical row the previous caller just inserted, and returns the same id. The refcount is decremented in a finally block and the entry is deleted from the map when it hits zero, so the map stays bounded for processes that resolve millions of distinct keys. Multi-process callers (separate `sec` invocations, workers sharing a backend) still need a backend-level UNIQUE constraint to be race-free — a single-process mutex isn't visible across processes. That is noted in the class JSDoc. New PersonResolver.race.test.ts and CompanyResolver.race.test.ts cover: - two parallel resolves on the same CIK collapsing to one canonical - ditto for the name-key path (Person) and CRD + name paths (Company) - 20-25 fan-out stress for queue-depth coverage - distinct-key parallel resolves do NOT block each other https://claude.ai/code/session_011KMd9sERp2rguyekAi8a3u * fix(pr-review): address 5 Copilot review findings on PR #121 * AsyncMutex: rewrite the file docblock so it documents the current per-key resolver find-or-create use-case instead of the retired observation-repo TOCTOU window (observation_id is now auto-generated). * BootstrapDownloadTask.streamDownloadToFile: close the writer before unlinking so Windows can actually delete the partial file, and surface writer.end() failures as the operation error on the success path instead of silently returning success over a half-flushed file. * BootstrapDownloadTask.test: snapshot+restore the previous SEC_RAW_DATA_FOLDER binding around setupRawDataFolder() so the globalServiceRegistry no longer leaks a stale tmpdir into later test files. * Form_1_Z.test: replace the vacuous `expect(cik).toBe(cik)` self- comparison with assertions that the value is a 1-10 digit string, matching the CIK_TYPE schema. * Form_C.storage: restore the schema-level minimum:0 invariant on `currentEmployees` that was lost when the field moved to string-typed decimals + numScalar(). Negative values now drop to null instead of persisting as a negative disclosure_value. Adds a regression test. * fix(cli): strip U+2028/U+2029 before formula-lead check JS `\n` does not match U+2028 LINE SEPARATOR or U+2029 PARAGRAPH SEPARATOR, so escapeCsvValue never split them out as line breaks. Some spreadsheet importers silently treat them as leading whitespace before formula parsing, leaving a "<LS>=cmd" payload unguarded. Add both codepoints to LEADING_WS and pin them with regression tests. https://claude.ai/code/session_011KMd9sERp2rguyekAi8a3u --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardens
escapeCsvValueinsrc/cli/output/TableRenderer.tsagainst three classes of CSV formula-injection bypass that the previous^[=+\-@\t\r]/check missed. Follows OWASP CSV Injection guidance for neutralizing untrusted cell content emitted bysec query --format csvbefore a spreadsheet opens it.Bypasses now closed
=cmd|'/c calc'!A0(note the leading space) was reaching the spreadsheet unprefixed and executing."safe\n=cmd"exposed the second line as a formula even when the first line was benign.Fix
DANGEROUS_LEAD(/^[=+\-@\t\r]/) +LEADING_WS(/^[\s ]+/) constants.needsFormulaPrefix(line)strips leading WS/NBSP before the dangerous-lead test.defuseLine(line)returns"'" + linewhen dangerous.escapeCsvValuenow splits the value on/(\r?\n)/, defuses each data line independently, preserves the original separators, and only then applies RFC 4180 quoting (now also triggered by\r, not just\n).escapeCsvValueis re-exported via__testingpurely for unit tests.Test plan
New
describe("escapeCsvValue")block insrc/cli/output/TableRenderer.test.ts:=cmd,+cmd,-cmd,@cmd,\tcmd,\rcmd.=cmd→' =cmd.=cmd→' =cmd.abcand123left unchanged.safe\n=cmd→"safe\n'=cmd".safe\r\n=cmd→"safe\r\n'=cmd".,wrapped in quotes."doubled inside wrapped cell.escapeCsvValue("")returns"".renderTabletests (json/csv/table format) still pass — header/data rows, comma/quote/null handling, original 5-row formula-prefix test, benign leads, table padding/truncation/pagination.Out of scope
Plans I and J (PR #118's branch) are intentionally not touched here; the PR author for that branch will apply them.
Generated by Claude Code