Removed deep-copy data.table ops from the dataProcess pipeline by tonywu1999 · Pull Request #208 · Vitek-Lab/MSstats

tonywu1999 · 2026-05-14T15:24:06Z

User description

Replaced input[, cols, with = FALSE] deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL).
Replaced row-shuffle input = input[order(...), ] in .prepareForDataProcess with data.table::setorder() (in place).
Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table.
Replaced the synthesised tmp string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call.
Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes.
Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy.
Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame().
Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept.
Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

I have read the MSstats contributing guidelines
My changes generate no new warnings
Any dependent changes have been merged and published in downstream modules
I have run the devtools::document() command after my changes and committed the added files

PR Type

Enhancement, Bug fix, Tests

Description

Replace full-table copies with in-place updates
Use keyed lookups for joins
Split survival outputs before finalization
Add memory-regression pipeline tests

Diagram Walkthrough

flowchart LR
  a["Copy-heavy data.table operations"]
  b["In-place set and setorder updates"]
  c["Keyed survival and fraction lookups"]
  d["Lower-memory summarization pipeline"]
  a -- "replaced by" --> b
  b -- "combined with" --> c
  c -- "reduces copies in" --> d

File Walkthrough

Relevant files

Enhancement

dataProcess.R `Limit censored-value updates to matching rows` R/dataProcess.R Replace `ifelse()` rewrites with targeted `:=` updates Only keep `predicted` on applicable censored rows Only overwrite `newABUNDANCE` where imputation applies	+12/-10
utils_checks.R `Avoid copies during input trimming and sorting` R/utils_checks.R Drop unwanted columns via `data.table::set(..., NULL)` Avoid retained-column deep copies during preparation Sort rows in place with `data.table::setorder()`	+13/-5
utils_feature_selection.R `Collapse feature preprocessing into one pass` R/utils_feature_selection.R Build the feature-selection working table once Handle optional `censored` values inline Compute `is_obs` without intermediate tables	+20/-22
utils_normalize.R `Use in-place cleanup and keyed fraction merges` R/utils_normalize.R Remove normalization temp columns in place Return the modified input after cleanup Replace `merge()` with keyed `newRun` assignment Filter fractions by direct `(FEATURE, FRACTION)` lookup	+23/-13

Bug fix

utils_output.R `Streamline summary output and imputation joins` R/utils_output.R Extract combined `predicted_survival` before finalization Free nested survival entries before binding summaries Replace merge-based imputation with keyed writes Guard join columns and update flags in place	+51/-27

Tests

test_memory_optimization_copies.R `Add memory regression tests for copy avoidance` inst/tinytest/test_memory_optimization_copies.R Add address-based assertions for in-place behavior Test `.normalizeMedian` temp-column cleanup Test `.finalizeTMP` keyed matches and unmatched `NA`s Verify `MSstatsSummarizationOutput` list splitting behavior	+469/-0

Motivation and Context — Short summary of the solution

The dataProcess pipeline used several copy-heavy data.table idioms (column-subset deep copies, merge(all.x=TRUE) joins, order-based reassignments, and full-vector ifelse writes) that caused excessive memory allocations. This PR replaces those with in-place data.table operations (data.table::set(..., value = NULL), data.table::setorder(), keyed which lookups + data.table::set, and targeted [i, j := v] updates) to reduce memory churn while preserving behavior. It also extracts predicted_survival earlier in summarization, nulls nested survival slots, fixes two regressions (predicted_survival join-columns intersection and retention of LABEL), updates documentation for changed parameters, and adds memory-regression tinytests. Full test suite passes.

Detailed changes

General
- Replace copy-producing column/row subset and merge patterns with in-place updates: data.table::set(j=..., value=NULL), data.table::setorder(), keyed which lookups + data.table::set, and targeted [i, j := v] conditional assignments.
- Extract predicted_survival out of nested summarization early and null survival entries before finalization; final feature/protein outputs converted to data.frame via data.table::setDF().
R/dataProcess.R
- MSstatsSummarizeSingleLinear / MSstatsSummarizeSingleTMP: replace bulk ifelse(...) column rewrites with conditional data.table subsetting + := assignments; set predicted = NA for non-censored labeled-reference rows and write fitted predicted only for censored subset; survival table built after in-place updates.
R/utils_checks.R
- .checkUnProcessedDataValidity: compute canonical column set early, coerce/add defaults using data.table::set, and drop extraneous columns with set(..., value=NULL) instead of later column-subsetting.
- .prepareForDataProcess: use data.table::set for derived columns (PEPTIDE, TRANSITION) and isotope mapping.
- .makeFactorColumns: replace copy-based input[order(...), ] with in-place data.table::setorder() using the same multi-column ordering.
R/utils_feature_selection.R
- .selectHighQualityFeatures: collapse subset+transform into a single pass; detect optional censored column, compute log2inty from ABUNDANCE (treating censored/NA as missing), derive is_obs from log2inty non-NA; remove intermediate censored/is_censored renames and avoid intermediate copy.
R/utils_normalize.R
- .normalizeMedian and .normalizeGlobalStandards: remove helper median/normalization columns in-place with set(..., value=NULL) rather than rebuilding input via column exclusion.
- MSstatsMergeFractions (non-TECHREPLICATE path): derive merged-run lookup from dcast, assign newRun to rows via keyed which lookup + data.table::set, filter out all-zero feature/fraction combos via keyed lookup, update originalRUN, refactor RUN levels, and clear helper newRun in-place; removed merge(all.x=TRUE)+tmp filtering and string-join filtering.
R/utils_output.R
- MSstatsSummarizationOutput: pre-extract predicted_survival (element [[2]]), finalize input using predicted_survival, rebuild summarized from protein-level only (element [[1]]) and null survival components; drop non-output columns in-place and convert outputs to data.frame with setDF().
- .finalizeInput / .finalizeTMP: accept predicted_survival instead of nested summarized list; TMP imputation applies predicted values via indexed keyed lookup and targeted in-place writes to newABUNDANCE and predicted; update nonmissing_orig via conditional :=.
Regressions fixed
- Ensure .finalizeTMP join_cols intersect predicted_survival columns to avoid losing LABEL.
- Preserve LABEL selection required downstream.
Documentation
- man/dot-finalizeInput.Rd and man/dot-finalizeTMP.Rd updated to reflect new parameter predicted_survival and reordered usage where applicable.

Unit tests added / modified

inst/tinytest/test_memory_optimization_copies.R (new; ~28 assertions, ~441 lines)
- Asserts in-place mutation via data.table::address() for .normalizeMedian and .finalizeTMP on small and scaled fixtures.
- Verifies temporary helper columns (ABUNDANCE_RUN, ABUNDANCE_FRACTION, newRun) are removed and ABUNDANCE remains.
- Validates .finalizeTMP join/update semantics: matched (cen, RUN, FEATURE) update newABUNDANCE from predicted_survival; unmatched keys produce NA; predicted column added.
- Scale tests (~50K rows) check in-place behavior, required output columns, and NA semantics for excluded join-key combinations.
- Confirms .finalizeTMP accepts combined predicted_survival table (not nested) and MSstatsSummarizationOutput decomposes nested summarized list correctly; combined predicted_survival is smaller than nested list.
inst/tinytest/test_MSstatsMergeFractions.R (new)
- Tests fraction-merging: single-fraction preservation, ABUNDANCE clamping before merging, non-TECHREPLICATE collapse semantics, dropping unobserved feature/fraction combos, RUN factor handling, and absence of helper newRun.
Test results reported: full test suite passes (224/224).

Coding guidelines / potential violations

In-place mutation / caller-side mutation risk
- Several changes perform in-place updates on data.tables passed by reference. This reduces copies but can unexpectedly mutate caller-provided tables. The concern raised in review (using := inside .selectHighQualityFeatures) highlights a real risk: mutating the input may break downstream callers that expect non-mutated inputs or rely on a previous version of the table. Functions that previously returned modified copies may now silently alter caller state.
API contract/documentation drift
- Internal signatures changed (.finalizeInput / .finalizeTMP now accept predicted_survival). Although documented, external callers (or reflective code) depending on prior signature could be affected; ensure internal-only status or coordinate broader updates.
Test coverage vs. behavioral guarantees
- Tests assert address preservation and in-place effects; however, the test suite focuses on current callers and fixtures. Any public-facing functions that relied on non-mutating behavior should be audited and documented to avoid breaking user code.
Suggestion (not implemented): where non-mutation is required by contract, explicitly copy (setDT(copy(...)) or copy() at function entry) or document mutability in function docs to make side-effects explicit.

* Replaced `input[, cols, with = FALSE]` deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL). * Replaced row-shuffle `input = input[order(...), ]` in .prepareForDataProcess with data.table::setorder() (in place). * Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table. * Replaced the synthesised `tmp` string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call. * Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes. * Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy. * Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame(). * Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept. * Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>

coderabbitai · 2026-05-14T15:24:13Z

📝 Walkthrough

Walkthrough

Refactor to perform in-place data.table updates, extract predicted_survival from summarization outputs and pass it into finalizers, replace merges/ifelse with indexed := assignments, and add tinytests validating in-place and merge behaviors.

Changes

DataProcess Pipeline Memory Optimization and Output Refactoring

Layer / File(s)	Summary
Data validation and preprocessing cleanup `R/utils_checks.R`	Compute allowed columns early, remove unexpected columns in-place, insert/coerce required columns with `data.table::set`, and use `data.table::setorder()` before factorization.
Feature selection and censored handling `R/utils_feature_selection.R`	Simplify preprocessing by treating optional `censored` as missingness and deriving `is_obs` directly from `ABUNDANCE`/log2 intensity.
Normalization cleanup and fraction-merge rewrite `R/utils_normalize.R`	Remove temporary median/normalization helpers with `data.table::set(..., value=NULL)` and rewrite `MSstatsMergeFractions` non-TECHREPLICATE branch to use dcast-derived lookup, join-index assignment, and in-place cleanup instead of merge/tmp columns.
Output assembly and predicted_survival plumbing `R/utils_output.R`	Extract `predicted_survival` from `summarized`, pass it into `.finalizeInput`/`.finalizeTMP`, apply imputations via indexed data.table lookups and targeted `:=` assignments, drop non-output columns in-place, and convert outputs to data.frames with `setDF`.
Censored-value imputation post-processing `R/dataProcess.R`	Replace `ifelse(...)` column reassignments with conditional `data.table` subset `:=` updates in `MSstatsSummarizeSingleLinear` and `MSstatsSummarizeSingleTMP`, with explicit handling for `is_labeled_reference` and conditional `nonmissing_orig` updates.
Memory-optimization and merge tests `inst/tinytest/*`	Add tests asserting in-place `data.table` behavior (addresses preserved), temporary columns removed, correct imputation for matched/unmatched keys, a ~50K-row stress test for `.finalizeTMP`, contract checks for `predicted_survival` shape/size, and `MSstatsMergeFractions` behavior tests.
Documentation updates `man/dot-finalizeInput.Rd`, `man/dot-finalizeTMP.Rd`	Help topics updated to reflect new `predicted_survival` parameter and reordered arguments.

Sequence Diagrams

No sequence diagrams generated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Vitek-Lab/MSstats#192: Related edits around labeled/reference handling passed into summarization (ref/is_labeled_reference) that touch similar code paths.
Vitek-Lab/MSstats#207: Overlapping changes to per-protein summarizers and survival/fit object handling in dataProcess.
Vitek-Lab/MSstats#193: Related modifications to censored post-imputation handling and label-aware gating in R/dataProcess.R.

Poem

🐇 I nudged the columns, light and sly,
In-place I whisper, no copies fly.
Survival split, then stitched again,
Tests check addresses, tidy and plain.
A rabbit's hop — efficient, spry.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description includes detailed bullet points explaining changes and testing, but the 'Motivation and Context' section under the template remains unfilled with only placeholder content, and the 'Changes' section bullets are empty.	Fill in the 'Motivation and Context' section with the problem summary and solution approach, and complete the 'Changes' bullet points with the specific modifications made.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main objective of the changeset: removing deep-copy operations from data.table operations in the dataProcess pipeline to improve memory efficiency.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch MSstats/work/20260514_avoid-deep-copy-ops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-14T15:28:41Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Input mutation `MSstatsPrepareForDataProcess` now removes columns from `input` in place and then adds derived columns on the same object. If the caller passes a `data.table` they intend to reuse, this function will now silently delete all non-selected columns from that original object, which did not happen before when the code first created a subset copy. drop_cols = setdiff(colnames(input), cols) for (col in drop_cols) data.table::set(input, j = col, value = NULL) input$PEPTIDE = paste(input$PEPTIDESEQUENCE, input$PRECURSORCHARGE, sep = "_") input$TRANSITION = paste(input$FRAGMENTION, Output mutation `MSstatsSummarizationOutput` now mutates its `input` argument by reference by dropping columns and converting it with `setDF()`. Because `input` is a `data.table`, the caller's object will be stripped down to output columns and lose its `data.table` class after this call, which can break any downstream code that reuses the same table. drop_cols = setdiff(colnames(input), output_cols) for (col in drop_cols) data.table::set(input, j = col, value = NULL) if (is.element("remove", colnames(processed))) { processed = processed[(remove), intersect(output_cols, colnames(processed)), with = FALSE] input = rbind(input, processed, fill = TRUE) } data.table::setDF(input) data.table::setDF(rqall) list(FeatureLevelData = input,

github-actions · 2026-05-14T15:33:19Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Avoid mutating caller objects `data.table::setDF()` mutates by reference, so this function now converts the caller's `input` object into a plain `data.frame`. That is a behavioral regression from `as.data.frame(...)` and can break later code that still expects `input` to be a `data.table`. R/utils_output.R [101-105] -data.table::setDF(input) -data.table::setDF(rqall) -list(FeatureLevelData = input, - ProteinLevelData = rqall, +feature_output = data.table::copy(input) +protein_output = data.table::copy(rqall) +data.table::setDF(feature_output) +data.table::setDF(protein_output) +list(FeatureLevelData = feature_output, + ProteinLevelData = protein_output, SummaryMethod = method) Suggestion importance[1-10]: 7 __ Why: This is correct and important for API behavior: `data.table::setDF()` mutates `input` and `rqall` by reference, unlike the previous `as.data.frame(...)` conversion. Copying before `setDF()` preserves the optimization while avoiding surprising side effects on caller-visible `data.table` objects.	Medium
Possible issue	Guard nested summary extraction Validate `summarized` before indexing `x[[2]]`. The new code dereferences every element immediately, so any `try-error` or short result now crashes here and bypasses the function's existing failure handling. R/utils_output.R [41-44] -predicted_survival = data.table::rbindlist(lapply(summarized, function(x) x[[2]]), - fill = TRUE) +invalid_summary = vapply( + summarized, + function(x) inherits(x, "try-error") \|\| length(x) < 2L, + logical(1) +) +if (any(invalid_summary)) { + stop("`summarized` contains failed or incomplete protein summaries.") +} +predicted_survival = data.table::rbindlist(lapply(summarized, `[[`, 2L), + fill = TRUE) for (i in seq_along(summarized)) summarized[[i]][[2]] = NULL input = .finalizeInput(input, predicted_survival, method, impute, censored_symbol) Suggestion importance[1-10]: 6 __ Why: This is a valid robustness concern: `MSstatsSummarizationOutput()` now dereferences `x[[2]]` before the later `inherits(summarized, "try-error")` check, so malformed entries in `summarized` can fail earlier with an uncontrolled error. The fix is relevant, but it mainly improves error handling rather than changing core results.	Low

mstaniak · 2026-05-15T10:14:47Z

Great update, thanks. Did you get a chance to evaluate the memory gain with lineprof or lobstr?

mstaniak

Hi,
thanks again for this update. I have a few minor comments

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

inst/tinytest/test_memory_optimization_copies.R (1)
328-369: ⚡ Quick win

Add a mixed-LABEL fixture to this contract test.

These assertions only exercise LABEL = "L", so a regression that drops LABEL from the survival projection or join keys would still pass here. A small L/H fixture with duplicated (RUN, FEATURE) values would cover the regression this stack is guarding against.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@inst/tinytest/test_memory_optimization_copies.R` around lines 328 - 369, The
test only uses LABEL = "L", so add mixed LABEL values and duplicated (RUN,
FEATURE) combos to both finalize_input_4 and pred_surv_4 to exercise join keys:
modify finalize_input_4$LABEL to contain a small mixture (e.g. "L" and "H" as a
factor) with duplicated RUN/FEATURE pairs across labels, and add a LABEL column
to pred_surv_4 with matching L/H entries (and duplicate RUN/FEATURE rows) so
MSstats:::.finalizeTMP must preserve/join on LABEL; keep result_4 assertions but
ensure the fixture includes those mixed-label cases to catch regressions that
drop LABEL from survival projection or join keys.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 206-212: The test currently counts all non-NA newABUNDANCE values
(matched_count) which can include rows that were already populated; instead
capture the rows that started with newABUNDANCE = NA before calling
.finalizeTMP() (e.g. store original_na_idx <-
is.na(original_result$newABUNDANCE)), then after .finalizeTMP() assert that
result$newABUNDANCE[original_na_idx] are non-NA and equal to the expected
imputed values from predicted_survival (use the (cen, RUN, FEATURE) key to
compare), replacing the generic expect_true(matched_count > 0) with direct
checks on those indices.

In `@R/utils_output.R`:
- Around line 41-49: Check whether summarized contains a "try-error" result
before accessing x[[1]]/x[[2]]: if any element of summarized inherits from
"try-error" (the fallback path intended for failed
MSstatsSummarizeWithSingleCore()), do not rbind or unpack
predicted_survival/protein_summaries; instead invoke the existing fallback
behavior (the same path currently guarded at the later check) and avoid calling
.finalizeInput on invalid data. Update the block that builds predicted_survival
and protein_summaries to first detect try-error in summarized and branch to the
fallback handling when present, referencing the summarized variable and the
.finalizeInput call to ensure invalid summary results are not unpacked.
- Around line 101-102: The calls to data.table::setDF(input) and
data.table::setDF(rqall) mutate caller-owned objects in place; update
MSstatsSummarizationOutput to avoid by-reference mutation by operating on copies
instead (e.g., create local copies like input_copy <- input and rqall_copy <-
rqall or coerce with as.data.frame() on copies) and call data.table::setDF() (or
as.data.frame) on those copies so the original input and rqall keep their
data.table class; ensure all subsequent uses in the function reference the
copied variables (input_copy, rqall_copy) rather than the originals.

---

Nitpick comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 328-369: The test only uses LABEL = "L", so add mixed LABEL values
and duplicated (RUN, FEATURE) combos to both finalize_input_4 and pred_surv_4 to
exercise join keys: modify finalize_input_4$LABEL to contain a small mixture
(e.g. "L" and "H" as a factor) with duplicated RUN/FEATURE pairs across labels,
and add a LABEL column to pred_surv_4 with matching L/H entries (and duplicate
RUN/FEATURE rows) so MSstats:::.finalizeTMP must preserve/join on LABEL; keep
result_4 assertions but ensure the fixture includes those mixed-label cases to
catch regressions that drop LABEL from survival projection or join keys.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2610ce77-e553-465d-8347-26a794f1b249

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa0bd1 and 0764b34.

📒 Files selected for processing (6)

R/dataProcess.R
R/utils_checks.R
R/utils_feature_selection.R
R/utils_normalize.R
R/utils_output.R
inst/tinytest/test_memory_optimization_copies.R

coderabbitai · 2026-06-02T05:51:14Z

+# (b) Matched rows should have newABUNDANCE from predicted_survival.
+# Any row whose (cen, RUN, FEATURE) key exists in predicted_survival
+# should get a non-NA newABUNDANCE value.
+matched_count = sum(!is.na(result$newABUNDANCE))
+expect_true(matched_count > 0,
+            info = paste("Matched rows should have non-NA newABUNDANCE.",
+                         "Found", matched_count, "non-NA values"))


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert on the imputed rows directly.

matched_count includes rows that were already non-NA before .finalizeTMP(), so this still passes if the join never writes the censored rows. Check exact expected values on the rows that start with newABUNDANCE = NA instead.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@inst/tinytest/test_memory_optimization_copies.R` around lines 206 - 212, The test currently counts all non-NA newABUNDANCE values (matched_count) which can include rows that were already populated; instead capture the rows that started with newABUNDANCE = NA before calling .finalizeTMP() (e.g. store original_na_idx <- is.na(original_result$newABUNDANCE)), then after .finalizeTMP() assert that result$newABUNDANCE[original_na_idx] are non-NA and equal to the expected imputed values from predicted_survival (use the (cen, RUN, FEATURE) key to compare), replacing the generic expect_true(matched_count > 0) with direct checks on those indices.

On a second thought - let's check the total count, not the minimum here.

(as in: comparing to 0 checks if anything is there, lets compare to the total expected number of non-NA rows, aka "==" check)

@mstaniak — makes sense. Based on the fixture, 8 rows start non-NA and predicted_survival covers 3 of the 4 originally-NA rows (row 11's (RUN=2, FEATURE=feat2) for P2 is intentionally excluded), so the expected total is 11:

# (b) After imputation, non-NA count should equal the number of rows # covered by predicted_survival plus the rows that were already non-NA. # Row 11 (P2 / feat2 / RUN=2) is intentionally excluded, so it stays NA. expect_equal(matched_count, 11L, info = paste("Expected 11 non-NA newABUNDANCE values after imputation.", "Found", matched_count))

This is a strict equality check and will catch both "nothing got imputed" and "too many rows were filled" regressions.

🧠 Learnings used

Learnt from: tonywu1999 Repo: Vitek-Lab/MSstats PR: 192 File: R/utils_censored.R:34-35 Timestamp: 2026-04-13T14:51:35.558Z Learning: In MSstats (R package, `R/utils_censored.R`), the `use_for_analysis` fallback in `MSstatsHandleMissing` intentionally uses `rep(TRUE, nrow(input))` when the `ref` column is absent. SRM experiments with heavy labels are required to use `EQUALIZEMEDIANS` normalization, which is the only path that adds the `ref` column. Other normalization methods (QUANTILE, GLOBALSTANDARDS, FALSE/NONE) are not appropriate for labeled experiments with heavy references, so when `ref` is missing the data is label-free and applying censoring to all rows is correct. A fallback to `input$LABEL == "L"` is unnecessary.

@mstaniak, got it — the total expected non-NA count after imputation is 11 (8 rows that were already non-NA + 3 censored rows that have a match in predicted_survival; only row 11 — P2/feat2/RUN=2 — stays NA because it was deliberately excluded).

Replace the assertion:

# (b) All rows that can be imputed should have non-NA newABUNDANCE. # 8 rows were already non-NA; of the 4 censored rows (2, 6, 8, 11), # rows 2/6/8 match in predicted_survival and row 11 (P2/feat2/RUN=2) was # deliberately excluded → 11 total non-NA values expected. expected_non_na = 11L matched_count = sum(!is.na(result$newABUNDANCE)) expect_equal(matched_count, expected_non_na, info = paste("Expected", expected_non_na, "non-NA newABUNDANCE values after imputation.", "Got:", matched_count))

This makes the check exact: it fails if even one eligible censored row was not filled in, rather than just verifying "at least something happened."

🧠 Learnings used

Learnt from: tonywu1999 Repo: Vitek-Lab/MSstats PR: 192 File: R/utils_censored.R:34-35 Timestamp: 2026-04-13T14:51:35.558Z Learning: In MSstats (R package, `R/utils_censored.R`), the `use_for_analysis` fallback in `MSstatsHandleMissing` intentionally uses `rep(TRUE, nrow(input))` when the `ref` column is absent. SRM experiments with heavy labels are required to use `EQUALIZEMEDIANS` normalization, which is the only path that adds the `ref` column. Other normalization methods (QUANTILE, GLOBALSTANDARDS, FALSE/NONE) are not appropriate for labeled experiments with heavy references, so when `ref` is missing the data is label-free and applying censoring to all rows is correct. A fallback to `input$LABEL == "L"` is unnecessary.

Good point - switched it to an exact == count check. It now computes the number of (cen, RUN, FEATURE) keys actually present in predicted_survival and asserts that newABUNDANCE has exactly that many non-NA values:

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

tonywu1999 · 2026-06-05T01:14:06Z

-    input[, is_obs := !(is.na(log2inty) | is_censored)]
-    input[, is_censored := NULL]
-
+                         is_obs = FALSE)]


What's the point of setting is_obs = FALSE if is_obs is set again in line 90?

tonywu1999 · 2026-06-05T01:20:15Z

-                         log2inty = ifelse(!(is.na(ABUNDANCE) | is_censored),
+                         log2inty = ifelse(!(is.na(ABUNDANCE) |
+                                            if (has_censored) censored else FALSE),
                                           ABUNDANCE, NA),


This statement is a little confusing. Can it be simplified? The if statement in the middle of this makes it confusing

ifelse(!is.na(ABUNDANCE) & !(has_censored & censored), ABUNDANCE, NA)

Also it should be converted to a series of ":=" I think

tonywu1999 · 2026-06-05T02:29:44Z

Why is this needed here?

The "merged" column existed only so it could be glued onto the end of the run name.
Now "merged" is visibly just a suffix in the name

tonywu1999 · 2026-06-05T03:08:13Z

                input$RUN = factor(input$RUN, levels=unique(input$RUN), labels=seq(1, length(unique(input$RUN))))
                input = input[, !(colnames(input) %in% c('tmp','newRun')), with = FALSE]
            }
        } else {


Could you add unit tests for the MSstatsMergeFractions function? It seems to be a really confusing part and it'd be helpful to have test cases that verify its original behavior remains unchanged.

One more note: this section of code took me a really long time to figure out what was going on. If possible, could you look to simplify the code so that it's obvious what this function's intent is? I'd timeblock at most 30 minutes for simplifying the code and making the original intent obvious, I don't think this code block is used often by users.

The merged := "merged" column only existed so the next line could append a "merged" suffix to the run name via positional indexing: .SDcols = c(1:3, ncol(match_runs)). The non-obvious part is what those positions actually select. With, say, 4 fractions, match_runs (after dcast) looks like:

pos: 1 2 3 4 5 6 7 GROUP_ORIGINAL SUBJECT_ORIGINAL "1" "2" "3" "4" merged

So c(1:3, ncol) = positions 1, 2, 3, 7 = GROUP_ORIGINAL, SUBJECT_ORIGINAL, fraction 1 only, and merged. Fractions 2–4 are never used — the merged run is named after just the first fraction's run, even though the code looks like it might use all of them.

Also, I focused on the non-TECHREPLICATE branch, which is the part this PR rewrote

tonywu1999 · 2026-06-05T03:19:15Z

+    summarized = data.table::rbindlist(protein_summaries, fill = TRUE)
+    rm(protein_summaries)
+
    if (inherits(summarized, "try-error")) {


how does summarized inherit try-error here if there's no try-catch block for line 48?

Yeah, I don't see a try-catch clause in the summarization code. Is there even a useful place for it there?
On a related note: we use withCallingHandlers for AFT call there, isn't it very slow?

fixed it with: if (is.null(summarized))

mstaniak · 2026-06-06T22:18:42Z

Hi,
thanks for the updates, I'll review these changes on Monday

mstaniak · 2026-06-08T07:52:17Z

        if (is_labeled_reference) {
-            single_protein[, predicted := ifelse(censored & is_labeled_ref == FALSE, predicted, NA)]
-            single_protein[, newABUNDANCE := ifelse(censored & is_labeled_ref == FALSE, predicted, newABUNDANCE)]
+            single_protein[!(censored & is_labeled_ref == FALSE), predicted := NA]
+            single_protein[(censored) & is_labeled_ref == FALSE,
+                           newABUNDANCE := predicted]
        } else {
-            single_protein[, predicted := ifelse(censored, predicted, NA)]
-            single_protein[, newABUNDANCE := ifelse(censored, predicted, newABUNDANCE)]
+            single_protein[!(censored), predicted := NA]
+            single_protein[(censored), newABUNDANCE := predicted]
        }
-
        survival = single_protein[, intersect(c(cols, "LABEL", "predicted"), colnames(single_protein)), with = FALSE]


Please

remove the "==FALSE" comparisons

check if the line 424 creates a copy that matters

Done — replaced all six is_labeled_ref == FALSE comparisons with !is_labeled_ref (in both MSstatsSummarizeSingleLinear and MSstatsSummarizeSingleTMP).

Checked with address() — single_protein[, keep, with = FALSE] does materialize a copy of the selected columns (they get new addresses, not shared). But it's a per-protein slice (a handful of columns × one protein's rows), not a whole-dataset copy, so the size is negligible relative to the copies this PR targets. It's also a necessary copy: survival is returned separately and, in the no-impute branch, gets survival[, predicted := NA] — sharing storage with single_protein would corrupt it. So I'd leave it as-is, but happy to revisit

mstaniak · 2026-06-08T07:53:19Z

+                           newABUNDANCE := predicted]
        }
        survival = single_protein[, intersect(c(cols, "LABEL", "predicted"), colnames(single_protein)), with = FALSE]
    } else {


save as above

let's remove the "==FALSE" comparisons

let's see if line 580 creates a copy (of a considerable size, if any)

mstaniak · 2026-06-08T07:53:59Z

+    drop_cols = setdiff(colnames(input), cols)
+    for (col in drop_cols) data.table::set(input, j = col, value = NULL)
+
+    input$PEPTIDE = paste(input$PEPTIDESEQUENCE,


"$" assignments should go. Line 215 is the way to go

mstaniak · 2026-06-08T07:54:35Z

-                         log2inty = ifelse(!(is.na(ABUNDANCE) | is_censored),
+                         log2inty = ifelse(!(is.na(ABUNDANCE) |
+                                            if (has_censored) censored else FALSE),
                                           ABUNDANCE, NA),


Also it should be converted to a series of ":=" I think

mstaniak · 2026-06-08T08:01:41Z

+    summarized = data.table::rbindlist(protein_summaries, fill = TRUE)
+    rm(protein_summaries)
+
    if (inherits(summarized, "try-error")) {


Yeah, I don't see a try-catch clause in the summarization code. Is there even a useful place for it there?
On a related note: we use withCallingHandlers for AFT call there, isn't it very slow?

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@man/dot-finalizeInput.Rd`:
- Around line 7-18: The Rd documentation is out of sync: update the \arguments
block for .finalizeInput to replace the stale item {summarized} with an entry
for {predicted_survival} that describes the new data.table contract;
specifically, ensure the arguments list includes input, predicted_survival,
method, impute, and censored_symbol and that the predicted_survival description
matches the function's expected data.table structure (columns/types) used by
.finalizeInput so \usage and \arguments are consistent with the function
signature.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eb8c80d9-00e4-4549-b689-e83302de9aac

📥 Commits

Reviewing files that changed from the base of the PR and between 0764b34 and 27d5f7c.

📒 Files selected for processing (6)

R/utils_feature_selection.R
R/utils_normalize.R
R/utils_output.R
inst/tinytest/test_MSstatsMergeFractions.R
man/dot-finalizeInput.Rd
man/dot-finalizeTMP.Rd

✅ Files skipped from review due to trivial changes (1)

man/dot-finalizeTMP.Rd

🚧 Files skipped from review as they are similar to previous changes (3)

R/utils_feature_selection.R
R/utils_normalize.R
R/utils_output.R

coderabbitai · 2026-06-09T06:53:30Z

+.finalizeInput(input, predicted_survival, method, impute, censored_symbol)
 }
 \arguments{
 \item{input}{feature-level data}

-\item{summarized}{protein-level data (list)}
-
 \item{method}{summary method}

 \item{impute}{if TRUE, censored missing values were imputed}

 \item{censored_symbol}{censored missing value indicator}
+
+\item{summarized}{protein-level data (list)}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Documented arguments are out of sync with the updated function signature.

Line 7 uses predicted_survival, but Line 18 still documents summarized and there is no predicted_survival argument entry. Please replace the stale summarized item with predicted_survival (and description matching the new data.table contract) so \usage and \arguments are consistent.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@man/dot-finalizeInput.Rd` around lines 7 - 18, The Rd documentation is out of sync: update the \arguments block for .finalizeInput to replace the stale item {summarized} with an entry for {predicted_survival} that describes the new data.table contract; specifically, ensure the arguments list includes input, predicted_survival, method, impute, and censored_symbol and that the predicted_survival description matches the function's expected data.table structure (columns/types) used by .finalizeInput so \usage and \arguments are consistent with the function signature.

Rudhik1904 · 2026-06-09T08:22:52Z

@mstaniak

For issue: #208 (comment)

I looked into this but think the current input[, list(...)] is actually the right call here, because of the following:

In-place := on input would corrupt the caller. .selectHighQualityFeatures is called from MSstatsSelectFeatures, which reuses the same input immediately afterwards:

features_quality = .selectHighQualityFeatures(input, min_feature_count)
input = merge(input, features_quality,
              by.x = c("LABEL", "PROTEIN", "FEATURE", "originalRUN"),
              by.y = c("label", "protein", "feature", "run"))

Since data.table :=/setnames modify by reference across function boundaries, renaming the columns to lowercase (or adding log2inty/is_obs) inside the function would mutate the caller's table and break that merge (its by.x keys would no longer exist). The input[, list(...)] form avoids this by returning a new table and only rebinding the local name.

It's already a minimal projection, not a full-table copy. list(...) materializes only the 7 columns the feature-selection step needs — it's not the kind of whole-table duplication this PR is targeting.

happy to reevaluate my decision if I miss something.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/utils_checks.R`:
- Line 173: The AnomalyScores column is being initialized with plain NA which
creates a logical column; update the call to data.table::set in R/utils_checks.R
(the line that sets j = "AnomalyScores") to use NA_real_ instead of NA so the
column type matches downstream numeric expectations (see
.updateColumnsForProcessing which uses NA_real_ and the uppercasing step that
may leave the column present).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9ac4369-d3ff-416f-8dc1-4fdc7f912e4e

📥 Commits

Reviewing files that changed from the base of the PR and between 27d5f7c and 50a30da.

📒 Files selected for processing (3)

R/dataProcess.R
R/utils_checks.R
inst/tinytest/test_memory_optimization_copies.R

🚧 Files skipped from review as they are similar to previous changes (2)

R/dataProcess.R
inst/tinytest/test_memory_optimization_copies.R

coderabbitai · 2026-06-09T08:29:54Z


    if (!"AnomalyScores" %in% colnames(input)){
-      input$AnomalyScores = NA
+      data.table::set(input, j = "AnomalyScores", value = NA)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use NA_real_ for type consistency with downstream numeric AnomalyScores.

Using plain NA creates a logical-typed column. Since ANOMALYSCORES is expected to be numeric (values like 0.03, 0.01 per tests), and .updateColumnsForProcessing at line 318 uses NA_real_, this should match.

After column names are uppercased at line 198, the check at line 317 will find the column already exists and skip the NA_real_ assignment, leaving the column as logical.

Proposed fix

- data.table::set(input, j = "AnomalyScores", value = NA) + data.table::set(input, j = "AnomalyScores", value = NA_real_)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@R/utils_checks.R` at line 173, The AnomalyScores column is being initialized with plain NA which creates a logical column; update the call to data.table::set in R/utils_checks.R (the line that sets j = "AnomalyScores") to use NA_real_ instead of NA so the column type matches downstream numeric expectations (see .updateColumnsForProcessing which uses NA_real_ and the uppercasing step that may leave the column present).

github-actions Bot added the Review effort 4/5 label May 14, 2026

Rudhik1904 added 2 commits May 15, 2026 00:11

Updating the comments.

b64b004

Updating some more comments.

e2bd771

mstaniak requested changes May 16, 2026

View reviewed changes

Comment thread inst/tinytest/test_memory_optimization_copies.R Outdated

Comment thread R/utils_feature_selection.R Outdated

Comment thread R/utils_normalize.R Outdated

Rudhik1904 added 2 commits June 2, 2026 00:39

Responding to the comments.

65df2b5

Responding to comments- 2

0764b34

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Rudhik1904 requested review from mstaniak June 3, 2026 00:05

tonywu1999 commented Jun 5, 2026

View reviewed changes

Vitek-Lab deleted a comment from coderabbitai Bot Jun 8, 2026

mstaniak requested changes Jun 8, 2026

View reviewed changes

Responding to Tony's suggestions

27d5f7c

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Rudhik1904 mentioned this pull request Jun 9, 2026

Also it should be converted to a series of ":=" I think #210

Open

Adressing Matt's comments

50a30da

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Conversation

tonywu1999 commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Motivation and Context

Changes

Testing

Checklist Before Requesting a Review

PR Type

Description

Diagram Walkthrough

File Walkthrough

Motivation and Context — Short summary of the solution

Detailed changes

Unit tests added / modified

Coding guidelines / potential violations

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

github-actions Bot commented May 14, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 14, 2026

PR Code Suggestions ✨

Uh oh!

mstaniak commented May 15, 2026

Uh oh!

mstaniak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tonywu1999 Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tonywu1999 commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading

coderabbitai Bot Jun 2, 2026 •

edited

Loading

tonywu1999 Jun 5, 2026 •

edited

Loading

tonywu1999 Jun 5, 2026 •

edited

Loading

Rudhik1904 Jun 9, 2026 •

edited

Loading

Rudhik1904 commented Jun 9, 2026 •

edited

Loading