Skip to content

fix(cli): harden CSV formula-injection escape (sec)#119

Open
sroussey wants to merge 1 commit into
mainfrom
claude/wonderful-hypatia-Y5xRl
Open

fix(cli): harden CSV formula-injection escape (sec)#119
sroussey wants to merge 1 commit into
mainfrom
claude/wonderful-hypatia-Y5xRl

Conversation

@sroussey
Copy link
Copy Markdown
Contributor

Summary

Hardens escapeCsvValue in src/cli/output/TableRenderer.ts against three classes of CSV formula-injection bypass that the previous ^[=+\-@\t\r]/ check missed. Follows OWASP CSV Injection guidance for neutralizing untrusted cell content emitted by sec query --format csv before a spreadsheet opens it.

Bypasses now closed

  1. Leading ASCII whitespace — Excel/Sheets/Numbers strip leading whitespace before parsing a cell as a formula, so =cmd|'/c calc'!A0 (note the leading space) was reaching the spreadsheet unprefixed and executing.
  2. Leading U+00A0 NBSP — Same as above; NBSP is stripped by the parsers and was not in the previous whitespace check.
  3. Dangerous chars after embedded newlines in a multi-line cell — Quoted multi-line cells are re-parsed per physical line, so "safe\n=cmd" exposed the second line as a formula even when the first line was benign.

Fix

  • New DANGEROUS_LEAD (/^[=+\-@\t\r]/) + LEADING_WS (/^[\s ]+/) constants.
  • needsFormulaPrefix(line) strips leading WS/NBSP before the dangerous-lead test.
  • defuseLine(line) returns "'" + line when dangerous.
  • escapeCsvValue now splits the value on /(\r?\n)/, defuses each data line independently, preserves the original separators, and only then applies RFC 4180 quoting (now also triggered by \r, not just \n).
  • Public API is unchanged; escapeCsvValue is re-exported via __testing purely for unit tests.

Test plan

New describe("escapeCsvValue") block in src/cli/output/TableRenderer.test.ts:

  • All six dangerous leads prefixed: =cmd, +cmd, -cmd, @cmd, \tcmd, \rcmd.
  • Leading ASCII space + =cmd' =cmd.
  • Leading NBSP + =cmd' =cmd.
  • Plain abc and 123 left unchanged.
  • Multi-line LF safe\n=cmd"safe\n'=cmd".
  • Multi-line CRLF safe\r\n=cmd"safe\r\n'=cmd".
  • Multi-line with multiple dangerous lines interleaved with safe ones — every dangerous line defused, safe ones untouched.
  • Cell with , wrapped in quotes.
  • Embedded " doubled inside wrapped cell.
  • escapeCsvValue("") returns "".
  • All existing renderTable tests (json/csv/table format) still pass — header/data rows, comma/quote/null handling, original 5-row formula-prefix test, benign leads, table padding/truncation/pagination.

Out of scope

Plans I and J (PR #118's branch) are intentionally not touched here; the PR author for that branch will apply them.


Generated by Claude Code

… formula injection (sec)

escapeCsvValue's `^[=+\-@\t\r]/` check missed three classes of bypass:
leading ASCII whitespace, U+00A0 NBSP, and dangerous chars after embedded
newlines inside a quoted multi-line cell (Excel/Sheets re-parse each
physical line). A SEC-supplied issuer name like ` =cmd|'/c calc'!A0` passed
through unprefixed.

Now strips leading whitespace (incl. NBSP) before the dangerous-lead test
and defuses every line after `\n`/`\r\n` independently. Regression test
matrix covers all six leading chars (incl. TAB/CR), leading space, NBSP,
and LF/CRLF multi-line cases.
sroussey added a commit that referenced this pull request Jun 2, 2026
Layers two refinements on top of the line-by-line CSV defusing in #119:

1. Tighten LEADING_WS so it only strips space-like characters that
   spreadsheets silently ignore (ASCII space, NBSP, SHY, ZWSP, ZWNJ,
   ZWJ, LRM, RLM, BOM). \t and \r are themselves dangerous formula
   leads, so we no longer strip them away before the DANGEROUS_LEAD
   check — otherwise "\t=cmd" or "\r=cmd" would slip through as
   non-dangerous after the strip.

2. Split on /(\r\n|\r|\n)/ instead of /(\r?\n)/ so that a bare CR
   inside a multi-line cell is also a line boundary; the line after
   the CR is independently defused. Excel re-parses every physical
   line of a quoted cell, including lines separated by lone CR.

Tests cover the zero-width-prefix bypasses (ZWSP/ZWNJ/ZWJ/LRM/RLM/SHY/BOM
+ "=cmd"), mixed-WS bypasses ("ZWSP space =cmd"), bare-CR-followed-by-formula
("safe\r=cmd"), and a negative control to prove ZWSP-then-benign is left
alone. All 44 cases in TableRenderer.test.ts pass.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
sroussey added a commit that referenced this pull request Jun 3, 2026
Layers two refinements on top of the line-by-line CSV defusing in #119:

1. Tighten LEADING_WS so it only strips space-like characters that
   spreadsheets silently ignore (ASCII space, NBSP, SHY, ZWSP, ZWNJ,
   ZWJ, LRM, RLM, BOM). \t and \r are themselves dangerous formula
   leads, so we no longer strip them away before the DANGEROUS_LEAD
   check — otherwise "\t=cmd" or "\r=cmd" would slip through as
   non-dangerous after the strip.

2. Split on /(\r\n|\r|\n)/ instead of /(\r?\n)/ so that a bare CR
   inside a multi-line cell is also a line boundary; the line after
   the CR is independently defused. Excel re-parses every physical
   line of a quoted cell, including lines separated by lone CR.

Tests cover the zero-width-prefix bypasses (ZWSP/ZWNJ/ZWJ/LRM/RLM/SHY/BOM
+ "=cmd"), mixed-WS bypasses ("ZWSP space =cmd"), bare-CR-followed-by-formula
("safe\r=cmd"), and a negative control to prove ZWSP-then-benign is left
alone. All 44 cases in TableRenderer.test.ts pass.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc
sroussey added a commit that referenced this pull request Jun 4, 2026
…SP + download leak (sec) (#121)

* fix(cli): complete CSV escape hardening with unicode + bare-CR support

Layers two refinements on top of the line-by-line CSV defusing in #119:

1. Tighten LEADING_WS so it only strips space-like characters that
   spreadsheets silently ignore (ASCII space, NBSP, SHY, ZWSP, ZWNJ,
   ZWJ, LRM, RLM, BOM). \t and \r are themselves dangerous formula
   leads, so we no longer strip them away before the DANGEROUS_LEAD
   check — otherwise "\t=cmd" or "\r=cmd" would slip through as
   non-dangerous after the strip.

2. Split on /(\r\n|\r|\n)/ instead of /(\r?\n)/ so that a bare CR
   inside a multi-line cell is also a line boundary; the line after
   the CR is independently defused. Excel re-parses every physical
   line of a quoted cell, including lines separated by lone CR.

Tests cover the zero-width-prefix bypasses (ZWSP/ZWNJ/ZWJ/LRM/RLM/SHY/BOM
+ "=cmd"), mixed-WS bypasses ("ZWSP space =cmd"), bare-CR-followed-by-formula
("safe\r=cmd"), and a negative control to prove ZWSP-then-benign is left
alone. All 44 cases in TableRenderer.test.ts pass.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc

* fix(storage,task): close observation TOCTOU + bootstrap zip leak

Three related correctness fixes:

1. PersonObservationRepo / CompanyObservationRepo upsertByNaturalKey
   had a TOCTOU window — two concurrent upserts on an empty in-memory
   store would both observe `size()` as 0 and assign observation_id=1
   to two different natural keys. Introduce a tiny process-wide
   AsyncMutex in src/util/AsyncMutex.ts that serializes the INSERT
   branch only (the existing-row update branch is keyed by
   observation_id and remains lock-free). Inside the critical section
   we re-query the natural key before allocating an id so that a
   parallel upsert of the same natural key still collapses to one row
   and one id.

2. BootstrapDownloadTask staged the multi-GB zip into
   SEC_RAW_DATA_FOLDER and only removed it on the success branch. If
   Bun.spawn threw, or unzip exited non-zero, the archive leaked until
   the next bootstrap run, silently consuming disk on operator VMs.
   Wrap the spawn+exit-code block in try/finally and remove the zip
   (with `force: true`) on every exit path.

3. AsyncMutex itself: trivial queue-based serializer. Rejections in
   one critical section don't poison the queue because the chained
   tracker swallows both branches; each caller still observes its own
   rejection through the returned promise.

Tests:
- PersonObservationRepo.test.ts / CompanyObservationRepo.test.ts each
  gain two new cases: concurrent INSERT-on-empty must yield distinct
  sequential ids, and concurrent UPDATE-existing + parallel INSERT
  must keep the original id stable and give the insert the next id.
- BootstrapDownloadTask.test.ts gains an "execute zip cleanup"
  describe block that stubs fetch + Bun.spawn + Bun.which and asserts
  the zip is removed on (a) spawn-throws, (b) unzip-exits-nonzero,
  and (c) the success path.

All 24 affected tests pass.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc

* fix(forms): stop fabricating 0 for empty exempt-offering numerics (1/5)

Plan H part 1 of 5 — split across multiple commits due to the
push-files tool size limit; logically one fix. See later commits for
Form_1_A tests, Form_1_K, Form_1_Z, and Form_C.

Adds the shared _valueHelpers module (numScalar/strScalar/numWrapped/
strWrapped) plus its unit tests. numScalar treats empty / whitespace-
only / NaN / Infinity input as null and rejects thousand-separator
strings; legitimate "0" round-trips so the regression guard holds.

NOTE: this is an inline copy of the helpers introduced by PR #118
under src/sec/forms/insider-trading/_valueHelpers.ts. When #118 merges
the duplicate should be removed in favour of a single shared module.

Also switches Form_1_A.schema.ts decimal-type aliases from
Type.Number() to Type.String() (4 aliases) so the storage layer can
make the null-vs-zero decision per cell with numScalar() instead of
Value.Convert silently producing 0 for empty text. Form_1_A.storage.ts
is updated to numScalar() every decimal field in processFinancialData
and the RegAOfferingHistory build, and the extractor_version is bumped
1.0.0 → 1.1.0 to force re-extraction.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc

* fix(forms): stop fabricating 0 for empty exempt-offering numerics (2/5)

Plan H part 2 of 5 — Form_1_A test updates.

Form_1_A.test.ts: decimal-typed leaves now arrive as raw strings from
the parser; assertions updated accordingly (pricePerSecurity, audit/
legal/etc. fees). New "validate financial data types and ranges" split
into decimal-string vs integer-number checks.

Form_1_A.storage.test.ts: three new regression tests pin the null /
empty / "0" round-trip behaviour through processForm1A.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc

* fix(forms): stop fabricating 0 for empty exempt-offering numerics (3/5)

Plan H part 3 of 5 — Form_1_K.

Schema: 2 DECIMAL_TYPE aliases switch from Type.Number() to
Type.String().

Storage (processOfferingHistory): price_per_security,
aggregate_offering_price, aggregate_offering_price_holders, and
estimated_net_amount now flow through numScalar(). extractor_version
bumped 1.0.0 → 1.1.0 with the standard re-extract comment.

Tests: parsing assertions updated for the new string contract; new
storage regression tests for null/empty/"0" round-trip.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc

* fix(forms): stop fabricating 0 for empty exempt-offering numerics (4/5)

Plan H part 4 of 5 — Form_1_Z.

Schema: 2 DECIMAL_TYPE aliases switch from Type.Number() to
Type.String().

Storage: processOfferingSummaries now flows price_per_security,
portionSecuritiesSoldIssuer/Securityholders, and issuerNetProceeds
through numScalar(). processCertificationSuspension's
approxRecordHolders also goes through numScalar() now (with a
null-check instead of an undefined-check so the row is dropped
when the input is empty/whitespace). extractor_version 1.0.0 →
1.1.0.

Tests: parsing assertions updated for string contract; new storage
regression tests for null/empty/"0" round-trip.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc

* fix(forms): stop fabricating 0 for empty exempt-offering numerics (5/5)

Plan H part 5 of 5 — Form_C and final wrap-up.

Schema: 4 DECIMAL_TYPE aliases switch from Type.Number() to
Type.String() — including the previously minimum-0 DECIMAL_TYPE7_2_
NONNEGATIVE. The minimum-0 invariant is no longer expressible at the
schema level; that constraint becomes the extractor's responsibility
(see Form_C.storage.ts numScalar() use).

Storage:
- processOfferingInfo now flows price, offering_amount, and
  maximum_offering_amount through numScalar(); priceDeterminationMethod
  remains a passthrough.
- processAnnualReportDisclosures now flows every disclosure field
  through numScalar() and drops the row when the result is null,
  instead of writing a fabricated 0 disclosure_value.
- extractor_version 1.0.0 → 1.1.0.

Tests: parsing assertions updated for the new string contract on
price/offeringAmount/maximumOfferingAmount/currentEmployees/
totalAssetMostRecentFiscalYear/revenueMostRecentFiscalYear. New
storage regression tests cover null/empty/"0" round-trip for both
offering and disclosure numerics.

Whole-suite verification (bun test) passes 845/845 with the helpers,
schemas, storage, and tests all in place.

https://claude.ai/code/session_01Wws8oZpB5imjKL2e7DRXtc

* fix(cli): broaden CSV LEADING_WS to cover all spreadsheet-stripped invisibles

The LEADING_WS character class used to strip a handful of zero-width
characters before the formula-injection check, but missed common
spreadsheet-stripped invisibles (U+00AD SHY, U+034F CGJ, U+061C ALM,
U+115F/U+1160 Hangul fillers, U+1680 Ogham space, U+180E MVS, the bidi
formatting block U+202A..U+202E, U+205F MMSP, the invisible operators
U+2061..U+2064 / U+206A..U+206F, U+3164 Hangul filler, U+FFA0 halfwidth
Hangul filler). Each of those can prefix a `=`/`+`/`-`/`@` cell that
Excel/Sheets/Numbers strip before formula parsing.

The previous source also kept raw invisible characters inline, which
editors silently normalise. The regex is now built from `\uXXXX` escape
sequences so the codepoint set is reviewable.

The pre-existing test "defuses leading U+00A0 NBSP before '='" passed
ASCII SPACE to escapeCsvValue() — vacuous because ASCII space was
already stripped. The test now uses the U+00A0 escape and we add a
table-driven regression for every codepoint LEADING_WS covers, plus
NBSP-prefixed variants of every DANGEROUS_LEAD codepoint.

https://claude.ai/code/session_011KMd9sERp2rguyekAi8a3u

* fix(task): clean up partial bulk-download on stream abort or size mismatch

streamDownloadToFile previously left a half-written archive on disk if
the fetch body errored mid-stream. The next bootstrap run would then
hand that partial multi-GB ZIP straight to unzip and either silently
produce a corrupt extract or fail in a way that masked the real cause.

Three fixes:
- await writer.write(value) so back-pressure is honoured. The previous
  code fired-and-forgot, which let bytes pile up in the writer queue.
- On any error path (network abort, writer failure, size mismatch),
  rmSync(destPath, { force: true }) before rethrowing. Best-effort —
  the original error is the one callers see.
- Validate `bytes === content-length` when the server advertised one.
  A short response is now a hard failure instead of a truncated archive.

The reader lock is released before the writer is closed so any pending
flush doesn't deadlock against a still-locked stream.

Tests cover the mid-stream abort and content-length mismatch paths,
asserting both the throw AND that the destination file is gone.

https://claude.ai/code/session_011KMd9sERp2rguyekAi8a3u

* fix(observation): delegate observation_id assignment to the storage backend

The old INSERT path computed `(await repo.size()) + 1` for the new row's
observation_id, guarded by a process-wide AsyncMutex to close the
size()/put() TOCTOU window. That was wrong in two ways:

  1. The mutex was only effective in a single process. Two `sec`
     processes — or two workers — sharing the same SQLite/Postgres
     backend would still collide.
  2. `repo.size()` reads the live row count, so a delete elsewhere in
     the table would silently reassign an id to a new natural key.

Replace the whole dance with `x-auto-generated: true` on the
observation_id schema field. The storage backend assigns the id (SQLite
INTEGER PRIMARY KEY AUTOINCREMENT, Postgres SERIAL, in-memory counter)
and returns the persisted row. `put()`'s return value gives us back the
assigned id without a second query. The repo-side AsyncMutex import is
dropped from both observation repos.

Tests previously asserted specific id values (`expect(...).toBe(1)`).
Auto-generated keys do not contract for "first id is 1" across backends
(Postgres SERIAL can skip on rollback), so the assertions are now
`toBeGreaterThan(0)` + `.not.toBe(other)`. EntityObserver tests get the
same treatment.

A smoke-test against InMemoryTabularStorage confirmed the backend does
return the auto-generated id from put() — the natural-key PK fallback
documented in the plan was not needed.

https://claude.ai/code/session_011KMd9sERp2rguyekAi8a3u

* fix(resolver): serialize person/company find-or-create per natural key

Two parallel `resolver.resolve(obs)` calls that mapped to the same
(resolver_version, key_kind, key_value) used to race: both observed no
canonical row, both minted a fresh UUID via randomUUID(), both inserted,
yielding two distinct canonical ids for what should have been the same
entity. Every downstream identity-link that pointed at the duplicate
canonical was then silently wrong.

PersonResolver and CompanyResolver now hold a process-wide
Map<string, { mutex: AsyncMutex; refs: number }> keyed on the
resolver_version-plus-natural-key tuple. The find-or-create critical
section runs inside `mutex.lock()` so a queued caller observes the
canonical row the previous caller just inserted, and returns the same
id. The refcount is decremented in a finally block and the entry is
deleted from the map when it hits zero, so the map stays bounded for
processes that resolve millions of distinct keys.

Multi-process callers (separate `sec` invocations, workers sharing a
backend) still need a backend-level UNIQUE constraint to be race-free —
a single-process mutex isn't visible across processes. That is noted
in the class JSDoc.

New PersonResolver.race.test.ts and CompanyResolver.race.test.ts cover:
- two parallel resolves on the same CIK collapsing to one canonical
- ditto for the name-key path (Person) and CRD + name paths (Company)
- 20-25 fan-out stress for queue-depth coverage
- distinct-key parallel resolves do NOT block each other

https://claude.ai/code/session_011KMd9sERp2rguyekAi8a3u

* fix(pr-review): address 5 Copilot review findings on PR #121

* AsyncMutex: rewrite the file docblock so it documents the current
  per-key resolver find-or-create use-case instead of the retired
  observation-repo TOCTOU window (observation_id is now auto-generated).
* BootstrapDownloadTask.streamDownloadToFile: close the writer before
  unlinking so Windows can actually delete the partial file, and
  surface writer.end() failures as the operation error on the success
  path instead of silently returning success over a half-flushed file.
* BootstrapDownloadTask.test: snapshot+restore the previous
  SEC_RAW_DATA_FOLDER binding around setupRawDataFolder() so the
  globalServiceRegistry no longer leaks a stale tmpdir into later test
  files.
* Form_1_Z.test: replace the vacuous `expect(cik).toBe(cik)` self-
  comparison with assertions that the value is a 1-10 digit string,
  matching the CIK_TYPE schema.
* Form_C.storage: restore the schema-level minimum:0 invariant on
  `currentEmployees` that was lost when the field moved to string-typed
  decimals + numScalar(). Negative values now drop to null instead of
  persisting as a negative disclosure_value. Adds a regression test.

* fix(cli): strip U+2028/U+2029 before formula-lead check

JS `\n` does not match U+2028 LINE SEPARATOR or U+2029 PARAGRAPH
SEPARATOR, so escapeCsvValue never split them out as line breaks.
Some spreadsheet importers silently treat them as leading whitespace
before formula parsing, leaving a "<LS>=cmd" payload unguarded.
Add both codepoints to LEADING_WS and pin them with regression tests.

https://claude.ai/code/session_011KMd9sERp2rguyekAi8a3u

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant