Skip to content

Removed deep-copy data.table ops from the dataProcess pipeline#208

Open
tonywu1999 wants to merge 7 commits into
develfrom
MSstats/work/20260514_avoid-deep-copy-ops
Open

Removed deep-copy data.table ops from the dataProcess pipeline#208
tonywu1999 wants to merge 7 commits into
develfrom
MSstats/work/20260514_avoid-deep-copy-ops

Conversation

@tonywu1999

@tonywu1999 tonywu1999 commented May 14, 2026

Copy link
Copy Markdown
Contributor

User description

  • Replaced input[, cols, with = FALSE] deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL).
  • Replaced row-shuffle input = input[order(...), ] in .prepareForDataProcess with data.table::setorder() (in place).
  • Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table.
  • Replaced the synthesised tmp string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call.
  • Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes.
  • Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy.
  • Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame().
  • Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept.
  • Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Motivation and Context

Please include relevant motivation and context of the problem along with a short summary of the solution.

Changes

Please provide a detailed bullet point list of your changes.

Testing

Please describe any unit tests you added or modified to verify your changes.

Checklist Before Requesting a Review

  • I have read the MSstats contributing guidelines
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules
  • I have run the devtools::document() command after my changes and committed the added files

PR Type

Enhancement, Bug fix, Tests


Description

  • Replace full-table copies with in-place updates

  • Use keyed lookups for joins

  • Split survival outputs before finalization

  • Add memory-regression pipeline tests


Diagram Walkthrough

flowchart LR
  a["Copy-heavy data.table operations"]
  b["In-place set and setorder updates"]
  c["Keyed survival and fraction lookups"]
  d["Lower-memory summarization pipeline"]
  a -- "replaced by" --> b
  b -- "combined with" --> c
  c -- "reduces copies in" --> d
Loading

File Walkthrough

Relevant files
Enhancement
dataProcess.R
Limit censored-value updates to matching rows                       

R/dataProcess.R

  • Replace ifelse() rewrites with targeted := updates
  • Only keep predicted on applicable censored rows
  • Only overwrite newABUNDANCE where imputation applies
+12/-10 
utils_checks.R
Avoid copies during input trimming and sorting                     

R/utils_checks.R

  • Drop unwanted columns via data.table::set(..., NULL)
  • Avoid retained-column deep copies during preparation
  • Sort rows in place with data.table::setorder()
+13/-5   
utils_feature_selection.R
Collapse feature preprocessing into one pass                         

R/utils_feature_selection.R

  • Build the feature-selection working table once
  • Handle optional censored values inline
  • Compute is_obs without intermediate tables
+20/-22 
utils_normalize.R
Use in-place cleanup and keyed fraction merges                     

R/utils_normalize.R

  • Remove normalization temp columns in place
  • Return the modified input after cleanup
  • Replace merge() with keyed newRun assignment
  • Filter fractions by direct (FEATURE, FRACTION) lookup
+23/-13 
Bug fix
utils_output.R
Streamline summary output and imputation joins                     

R/utils_output.R

  • Extract combined predicted_survival before finalization
  • Free nested survival entries before binding summaries
  • Replace merge-based imputation with keyed writes
  • Guard join columns and update flags in place
+51/-27 
Tests
test_memory_optimization_copies.R
Add memory regression tests for copy avoidance                     

inst/tinytest/test_memory_optimization_copies.R

  • Add address-based assertions for in-place behavior
  • Test .normalizeMedian temp-column cleanup
  • Test .finalizeTMP keyed matches and unmatched NAs
  • Verify MSstatsSummarizationOutput list splitting behavior
+469/-0 

Motivation and Context — Short summary of the solution

The dataProcess pipeline used several copy-heavy data.table idioms (column-subset deep copies, merge(all.x=TRUE) joins, order-based reassignments, and full-vector ifelse writes) that caused excessive memory allocations. This PR replaces those with in-place data.table operations (data.table::set(..., value = NULL), data.table::setorder(), keyed which lookups + data.table::set, and targeted [i, j := v] updates) to reduce memory churn while preserving behavior. It also extracts predicted_survival earlier in summarization, nulls nested survival slots, fixes two regressions (predicted_survival join-columns intersection and retention of LABEL), updates documentation for changed parameters, and adds memory-regression tinytests. Full test suite passes.

Detailed changes

  • General

    • Replace copy-producing column/row subset and merge patterns with in-place updates: data.table::set(j=..., value=NULL), data.table::setorder(), keyed which lookups + data.table::set, and targeted [i, j := v] conditional assignments.
    • Extract predicted_survival out of nested summarization early and null survival entries before finalization; final feature/protein outputs converted to data.frame via data.table::setDF().
  • R/dataProcess.R

    • MSstatsSummarizeSingleLinear / MSstatsSummarizeSingleTMP: replace bulk ifelse(...) column rewrites with conditional data.table subsetting + := assignments; set predicted = NA for non-censored labeled-reference rows and write fitted predicted only for censored subset; survival table built after in-place updates.
  • R/utils_checks.R

    • .checkUnProcessedDataValidity: compute canonical column set early, coerce/add defaults using data.table::set, and drop extraneous columns with set(..., value=NULL) instead of later column-subsetting.
    • .prepareForDataProcess: use data.table::set for derived columns (PEPTIDE, TRANSITION) and isotope mapping.
    • .makeFactorColumns: replace copy-based input[order(...), ] with in-place data.table::setorder() using the same multi-column ordering.
  • R/utils_feature_selection.R

    • .selectHighQualityFeatures: collapse subset+transform into a single pass; detect optional censored column, compute log2inty from ABUNDANCE (treating censored/NA as missing), derive is_obs from log2inty non-NA; remove intermediate censored/is_censored renames and avoid intermediate copy.
  • R/utils_normalize.R

    • .normalizeMedian and .normalizeGlobalStandards: remove helper median/normalization columns in-place with set(..., value=NULL) rather than rebuilding input via column exclusion.
    • MSstatsMergeFractions (non-TECHREPLICATE path): derive merged-run lookup from dcast, assign newRun to rows via keyed which lookup + data.table::set, filter out all-zero feature/fraction combos via keyed lookup, update originalRUN, refactor RUN levels, and clear helper newRun in-place; removed merge(all.x=TRUE)+tmp filtering and string-join filtering.
  • R/utils_output.R

    • MSstatsSummarizationOutput: pre-extract predicted_survival (element [[2]]), finalize input using predicted_survival, rebuild summarized from protein-level only (element [[1]]) and null survival components; drop non-output columns in-place and convert outputs to data.frame with setDF().
    • .finalizeInput / .finalizeTMP: accept predicted_survival instead of nested summarized list; TMP imputation applies predicted values via indexed keyed lookup and targeted in-place writes to newABUNDANCE and predicted; update nonmissing_orig via conditional :=.
  • Regressions fixed

    • Ensure .finalizeTMP join_cols intersect predicted_survival columns to avoid losing LABEL.
    • Preserve LABEL selection required downstream.
  • Documentation

    • man/dot-finalizeInput.Rd and man/dot-finalizeTMP.Rd updated to reflect new parameter predicted_survival and reordered usage where applicable.

Unit tests added / modified

  • inst/tinytest/test_memory_optimization_copies.R (new; ~28 assertions, ~441 lines)
    • Asserts in-place mutation via data.table::address() for .normalizeMedian and .finalizeTMP on small and scaled fixtures.
    • Verifies temporary helper columns (ABUNDANCE_RUN, ABUNDANCE_FRACTION, newRun) are removed and ABUNDANCE remains.
    • Validates .finalizeTMP join/update semantics: matched (cen, RUN, FEATURE) update newABUNDANCE from predicted_survival; unmatched keys produce NA; predicted column added.
    • Scale tests (~50K rows) check in-place behavior, required output columns, and NA semantics for excluded join-key combinations.
    • Confirms .finalizeTMP accepts combined predicted_survival table (not nested) and MSstatsSummarizationOutput decomposes nested summarized list correctly; combined predicted_survival is smaller than nested list.
  • inst/tinytest/test_MSstatsMergeFractions.R (new)
    • Tests fraction-merging: single-fraction preservation, ABUNDANCE clamping before merging, non-TECHREPLICATE collapse semantics, dropping unobserved feature/fraction combos, RUN factor handling, and absence of helper newRun.
  • Test results reported: full test suite passes (224/224).

Coding guidelines / potential violations

  • In-place mutation / caller-side mutation risk
    • Several changes perform in-place updates on data.tables passed by reference. This reduces copies but can unexpectedly mutate caller-provided tables. The concern raised in review (using := inside .selectHighQualityFeatures) highlights a real risk: mutating the input may break downstream callers that expect non-mutated inputs or rely on a previous version of the table. Functions that previously returned modified copies may now silently alter caller state.
  • API contract/documentation drift
    • Internal signatures changed (.finalizeInput / .finalizeTMP now accept predicted_survival). Although documented, external callers (or reflective code) depending on prior signature could be affected; ensure internal-only status or coordinate broader updates.
  • Test coverage vs. behavioral guarantees
    • Tests assert address preservation and in-place effects; however, the test suite focuses on current callers and fixtures. Any public-facing functions that relied on non-mutating behavior should be audited and documented to avoid breaking user code.
  • Suggestion (not implemented): where non-mutation is required by contract, explicitly copy (setDT(copy(...)) or copy() at function entry) or document mutability in function docs to make side-effects explicit.

* Replaced `input[, cols, with = FALSE]` deep-copy in
  MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with
  drop-cols loops via data.table::set(j = ..., value = NULL).
* Replaced row-shuffle `input = input[order(...), ]` in
  .prepareForDataProcess with data.table::setorder() (in place).
* Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and
  .finalizeTMP with keyed-which lookups + data.table::set() writes —
  avoids deep-copying the whole table.
* Replaced the synthesised `tmp` string-join filter in
  MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed
  lookup; drops two large character vectors and a paste() call.
* Replaced ifelse() full-vector writes for predicted/newABUNDANCE
  and nonmissing_orig with targeted [i, j := v] in-place writes.
* Collapsed the two-step subset+transform in .selectHighQualityFeatures
  into a single pass to eliminate one intermediate data.table copy.
* Reworked MSstatsSummarizationOutput to extract predicted_survival
  upfront and null per-protein second slots so the nested-list
  duplication is freed before .finalizeTMP runs; switched the
  final return to data.table::setDF() in place of as.data.frame().
* Fixed two regressions in the original commit: (1) .finalizeTMP's
  join_cols must intersect with predicted_survival's columns so the
  keyed lookup doesn't error on missing LABEL; (2) reverted the
  survival-column-selection tightening that dropped LABEL — a
  downstream test in test_dataProcess.R relies on LABEL being kept.
* Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4
  — 28 assertions, all green. Full suite 224/224 OK.

See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md

Co-Authored-By: Claude <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Refactor to perform in-place data.table updates, extract predicted_survival from summarization outputs and pass it into finalizers, replace merges/ifelse with indexed := assignments, and add tinytests validating in-place and merge behaviors.

Changes

DataProcess Pipeline Memory Optimization and Output Refactoring

Layer / File(s) Summary
Data validation and preprocessing cleanup
R/utils_checks.R
Compute allowed columns early, remove unexpected columns in-place, insert/coerce required columns with data.table::set, and use data.table::setorder() before factorization.
Feature selection and censored handling
R/utils_feature_selection.R
Simplify preprocessing by treating optional censored as missingness and deriving is_obs directly from ABUNDANCE/log2 intensity.
Normalization cleanup and fraction-merge rewrite
R/utils_normalize.R
Remove temporary median/normalization helpers with data.table::set(..., value=NULL) and rewrite MSstatsMergeFractions non-TECHREPLICATE branch to use dcast-derived lookup, join-index assignment, and in-place cleanup instead of merge/tmp columns.
Output assembly and predicted_survival plumbing
R/utils_output.R
Extract predicted_survival from summarized, pass it into .finalizeInput/.finalizeTMP, apply imputations via indexed data.table lookups and targeted := assignments, drop non-output columns in-place, and convert outputs to data.frames with setDF.
Censored-value imputation post-processing
R/dataProcess.R
Replace ifelse(...) column reassignments with conditional data.table subset := updates in MSstatsSummarizeSingleLinear and MSstatsSummarizeSingleTMP, with explicit handling for is_labeled_reference and conditional nonmissing_orig updates.
Memory-optimization and merge tests
inst/tinytest/*
Add tests asserting in-place data.table behavior (addresses preserved), temporary columns removed, correct imputation for matched/unmatched keys, a ~50K-row stress test for .finalizeTMP, contract checks for predicted_survival shape/size, and MSstatsMergeFractions behavior tests.
Documentation updates
man/dot-finalizeInput.Rd, man/dot-finalizeTMP.Rd
Help topics updated to reflect new predicted_survival parameter and reordered arguments.

Sequence Diagrams

No sequence diagrams generated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Vitek-Lab/MSstats#192: Related edits around labeled/reference handling passed into summarization (ref/is_labeled_reference) that touch similar code paths.
  • Vitek-Lab/MSstats#207: Overlapping changes to per-protein summarizers and survival/fit object handling in dataProcess.
  • Vitek-Lab/MSstats#193: Related modifications to censored post-imputation handling and label-aware gating in R/dataProcess.R.

Poem

🐇 I nudged the columns, light and sly,
In-place I whisper, no copies fly.
Survival split, then stitched again,
Tests check addresses, tidy and plain.
A rabbit's hop — efficient, spry.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description includes detailed bullet points explaining changes and testing, but the 'Motivation and Context' section under the template remains unfilled with only placeholder content, and the 'Changes' section bullets are empty. Fill in the 'Motivation and Context' section with the problem summary and solution approach, and complete the 'Changes' bullet points with the specific modifications made.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main objective of the changeset: removing deep-copy operations from data.table operations in the dataProcess pipeline to improve memory efficiency.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch MSstats/work/20260514_avoid-deep-copy-ops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Input mutation

MSstatsPrepareForDataProcess now removes columns from input in place and then adds derived columns on the same object. If the caller passes a data.table they intend to reuse, this function will now silently delete all non-selected columns from that original object, which did not happen before when the code first created a subset copy.

drop_cols = setdiff(colnames(input), cols)
for (col in drop_cols) data.table::set(input, j = col, value = NULL)

input$PEPTIDE = paste(input$PEPTIDESEQUENCE,
                      input$PRECURSORCHARGE, sep = "_")
input$TRANSITION = paste(input$FRAGMENTION, 
Output mutation

MSstatsSummarizationOutput now mutates its input argument by reference by dropping columns and converting it with setDF(). Because input is a data.table, the caller's object will be stripped down to output columns and lose its data.table class after this call, which can break any downstream code that reuses the same table.

drop_cols = setdiff(colnames(input), output_cols)
for (col in drop_cols) data.table::set(input, j = col, value = NULL)

if (is.element("remove", colnames(processed))) {
    processed = processed[(remove),
                          intersect(output_cols,
                                    colnames(processed)), with = FALSE]
    input = rbind(input, processed, fill = TRUE)
}
data.table::setDF(input)
data.table::setDF(rqall)
list(FeatureLevelData = input,

@github-actions

Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Avoid mutating caller objects

data.table::setDF() mutates by reference, so this function now converts the caller's
input object into a plain data.frame. That is a behavioral regression from
as.data.frame(...) and can break later code that still expects input to be a
data.table.

R/utils_output.R [101-105]

-data.table::setDF(input)
-data.table::setDF(rqall)
-list(FeatureLevelData = input,
-     ProteinLevelData = rqall,
+feature_output = data.table::copy(input)
+protein_output = data.table::copy(rqall)
+data.table::setDF(feature_output)
+data.table::setDF(protein_output)
+list(FeatureLevelData = feature_output,
+     ProteinLevelData = protein_output,
      SummaryMethod = method)
Suggestion importance[1-10]: 7

__

Why: This is correct and important for API behavior: data.table::setDF() mutates input and rqall by reference, unlike the previous as.data.frame(...) conversion. Copying before setDF() preserves the optimization while avoiding surprising side effects on caller-visible data.table objects.

Medium
Guard nested summary extraction

Validate summarized before indexing x[[2]]. The new code dereferences every element
immediately, so any try-error or short result now crashes here and bypasses the
function's existing failure handling.

R/utils_output.R [41-44]

-predicted_survival = data.table::rbindlist(lapply(summarized, function(x) x[[2]]),
-                                            fill = TRUE)
+invalid_summary = vapply(
+    summarized,
+    function(x) inherits(x, "try-error") || length(x) < 2L,
+    logical(1)
+)
+if (any(invalid_summary)) {
+    stop("`summarized` contains failed or incomplete protein summaries.")
+}
+predicted_survival = data.table::rbindlist(lapply(summarized, `[[`, 2L),
+                                           fill = TRUE)
 for (i in seq_along(summarized)) summarized[[i]][[2]] = NULL
 input = .finalizeInput(input, predicted_survival, method, impute, censored_symbol)
Suggestion importance[1-10]: 6

__

Why: This is a valid robustness concern: MSstatsSummarizationOutput() now dereferences x[[2]] before the later inherits(summarized, "try-error") check, so malformed entries in summarized can fail earlier with an uncontrolled error. The fix is relevant, but it mainly improves error handling rather than changing core results.

Low

@mstaniak

Copy link
Copy Markdown
Contributor

Great update, thanks. Did you get a chance to evaluate the memory gain with lineprof or lobstr?

@mstaniak mstaniak left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
thanks again for this update. I have a few minor comments

Comment thread inst/tinytest/test_memory_optimization_copies.R Outdated
Comment thread R/utils_feature_selection.R Outdated
Comment thread R/utils_normalize.R Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
inst/tinytest/test_memory_optimization_copies.R (1)

328-369: ⚡ Quick win

Add a mixed-LABEL fixture to this contract test.

These assertions only exercise LABEL = "L", so a regression that drops LABEL from the survival projection or join keys would still pass here. A small L/H fixture with duplicated (RUN, FEATURE) values would cover the regression this stack is guarding against.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@inst/tinytest/test_memory_optimization_copies.R` around lines 328 - 369, The
test only uses LABEL = "L", so add mixed LABEL values and duplicated (RUN,
FEATURE) combos to both finalize_input_4 and pred_surv_4 to exercise join keys:
modify finalize_input_4$LABEL to contain a small mixture (e.g. "L" and "H" as a
factor) with duplicated RUN/FEATURE pairs across labels, and add a LABEL column
to pred_surv_4 with matching L/H entries (and duplicate RUN/FEATURE rows) so
MSstats:::.finalizeTMP must preserve/join on LABEL; keep result_4 assertions but
ensure the fixture includes those mixed-label cases to catch regressions that
drop LABEL from survival projection or join keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 206-212: The test currently counts all non-NA newABUNDANCE values
(matched_count) which can include rows that were already populated; instead
capture the rows that started with newABUNDANCE = NA before calling
.finalizeTMP() (e.g. store original_na_idx <-
is.na(original_result$newABUNDANCE)), then after .finalizeTMP() assert that
result$newABUNDANCE[original_na_idx] are non-NA and equal to the expected
imputed values from predicted_survival (use the (cen, RUN, FEATURE) key to
compare), replacing the generic expect_true(matched_count > 0) with direct
checks on those indices.

In `@R/utils_output.R`:
- Around line 41-49: Check whether summarized contains a "try-error" result
before accessing x[[1]]/x[[2]]: if any element of summarized inherits from
"try-error" (the fallback path intended for failed
MSstatsSummarizeWithSingleCore()), do not rbind or unpack
predicted_survival/protein_summaries; instead invoke the existing fallback
behavior (the same path currently guarded at the later check) and avoid calling
.finalizeInput on invalid data. Update the block that builds predicted_survival
and protein_summaries to first detect try-error in summarized and branch to the
fallback handling when present, referencing the summarized variable and the
.finalizeInput call to ensure invalid summary results are not unpacked.
- Around line 101-102: The calls to data.table::setDF(input) and
data.table::setDF(rqall) mutate caller-owned objects in place; update
MSstatsSummarizationOutput to avoid by-reference mutation by operating on copies
instead (e.g., create local copies like input_copy <- input and rqall_copy <-
rqall or coerce with as.data.frame() on copies) and call data.table::setDF() (or
as.data.frame) on those copies so the original input and rqall keep their
data.table class; ensure all subsequent uses in the function reference the
copied variables (input_copy, rqall_copy) rather than the originals.

---

Nitpick comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 328-369: The test only uses LABEL = "L", so add mixed LABEL values
and duplicated (RUN, FEATURE) combos to both finalize_input_4 and pred_surv_4 to
exercise join keys: modify finalize_input_4$LABEL to contain a small mixture
(e.g. "L" and "H" as a factor) with duplicated RUN/FEATURE pairs across labels,
and add a LABEL column to pred_surv_4 with matching L/H entries (and duplicate
RUN/FEATURE rows) so MSstats:::.finalizeTMP must preserve/join on LABEL; keep
result_4 assertions but ensure the fixture includes those mixed-label cases to
catch regressions that drop LABEL from survival projection or join keys.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2610ce77-e553-465d-8347-26a794f1b249

📥 Commits

Reviewing files that changed from the base of the PR and between 3fa0bd1 and 0764b34.

📒 Files selected for processing (6)
  • R/dataProcess.R
  • R/utils_checks.R
  • R/utils_feature_selection.R
  • R/utils_normalize.R
  • R/utils_output.R
  • inst/tinytest/test_memory_optimization_copies.R

Comment on lines +206 to +212
# (b) Matched rows should have newABUNDANCE from predicted_survival.
# Any row whose (cen, RUN, FEATURE) key exists in predicted_survival
# should get a non-NA newABUNDANCE value.
matched_count = sum(!is.na(result$newABUNDANCE))
expect_true(matched_count > 0,
info = paste("Matched rows should have non-NA newABUNDANCE.",
"Found", matched_count, "non-NA values"))

@coderabbitai coderabbitai Bot Jun 2, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert on the imputed rows directly.

matched_count includes rows that were already non-NA before .finalizeTMP(), so this still passes if the join never writes the censored rows. Check exact expected values on the rows that start with newABUNDANCE = NA instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@inst/tinytest/test_memory_optimization_copies.R` around lines 206 - 212, The
test currently counts all non-NA newABUNDANCE values (matched_count) which can
include rows that were already populated; instead capture the rows that started
with newABUNDANCE = NA before calling .finalizeTMP() (e.g. store original_na_idx
<- is.na(original_result$newABUNDANCE)), then after .finalizeTMP() assert that
result$newABUNDANCE[original_na_idx] are non-NA and equal to the expected
imputed values from predicted_survival (use the (cen, RUN, FEATURE) key to
compare), replacing the generic expect_true(matched_count > 0) with direct
checks on those indices.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought - let's check the total count, not the minimum here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as in: comparing to 0 checks if anything is there, lets compare to the total expected number of non-NA rows, aka "==" check)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mstaniak — makes sense. Based on the fixture, 8 rows start non-NA and predicted_survival covers 3 of the 4 originally-NA rows (row 11's (RUN=2, FEATURE=feat2) for P2 is intentionally excluded), so the expected total is 11:

# (b) After imputation, non-NA count should equal the number of rows
#     covered by predicted_survival plus the rows that were already non-NA.
#     Row 11 (P2 / feat2 / RUN=2) is intentionally excluded, so it stays NA.
expect_equal(matched_count, 11L,
             info = paste("Expected 11 non-NA newABUNDANCE values after imputation.",
                          "Found", matched_count))

This is a strict equality check and will catch both "nothing got imputed" and "too many rows were filled" regressions.


🧠 Learnings used
Learnt from: tonywu1999
Repo: Vitek-Lab/MSstats PR: 192
File: R/utils_censored.R:34-35
Timestamp: 2026-04-13T14:51:35.558Z
Learning: In MSstats (R package, `R/utils_censored.R`), the `use_for_analysis` fallback in `MSstatsHandleMissing` intentionally uses `rep(TRUE, nrow(input))` when the `ref` column is absent. SRM experiments with heavy labels are required to use `EQUALIZEMEDIANS` normalization, which is the only path that adds the `ref` column. Other normalization methods (QUANTILE, GLOBALSTANDARDS, FALSE/NONE) are not appropriate for labeled experiments with heavy references, so when `ref` is missing the data is label-free and applying censoring to all rows is correct. A fallback to `input$LABEL == "L"` is unnecessary.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mstaniak, got it — the total expected non-NA count after imputation is 11 (8 rows that were already non-NA + 3 censored rows that have a match in predicted_survival; only row 11 — P2/feat2/RUN=2 — stays NA because it was deliberately excluded).

Replace the assertion:

# (b) All rows that can be imputed should have non-NA newABUNDANCE.
# 8 rows were already non-NA; of the 4 censored rows (2, 6, 8, 11),
# rows 2/6/8 match in predicted_survival and row 11 (P2/feat2/RUN=2) was
# deliberately excluded → 11 total non-NA values expected.
expected_non_na = 11L
matched_count = sum(!is.na(result$newABUNDANCE))
expect_equal(matched_count, expected_non_na,
             info = paste("Expected", expected_non_na,
                          "non-NA newABUNDANCE values after imputation.",
                          "Got:", matched_count))

This makes the check exact: it fails if even one eligible censored row was not filled in, rather than just verifying "at least something happened."


🧠 Learnings used
Learnt from: tonywu1999
Repo: Vitek-Lab/MSstats PR: 192
File: R/utils_censored.R:34-35
Timestamp: 2026-04-13T14:51:35.558Z
Learning: In MSstats (R package, `R/utils_censored.R`), the `use_for_analysis` fallback in `MSstatsHandleMissing` intentionally uses `rep(TRUE, nrow(input))` when the `ref` column is absent. SRM experiments with heavy labels are required to use `EQUALIZEMEDIANS` normalization, which is the only path that adds the `ref` column. Other normalization methods (QUANTILE, GLOBALSTANDARDS, FALSE/NONE) are not appropriate for labeled experiments with heavy references, so when `ref` is missing the data is label-free and applying censoring to all rows is correct. A fallback to `input$LABEL == "L"` is unnecessary.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - switched it to an exact == count check. It now computes the number of (cen, RUN, FEATURE) keys actually present in predicted_survival and asserts that newABUNDANCE has exactly that many non-NA values:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!

Comment thread R/utils_output.R Outdated
Comment thread R/utils_output.R Outdated
@Rudhik1904 Rudhik1904 requested review from mstaniak June 3, 2026 00:05
Comment thread R/utils_feature_selection.R Outdated
input[, is_obs := !(is.na(log2inty) | is_censored)]
input[, is_censored := NULL]

is_obs = FALSE)]

@tonywu1999 tonywu1999 Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of setting is_obs = FALSE if is_obs is set again in line 90?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment thread R/utils_feature_selection.R Outdated
log2inty = ifelse(!(is.na(ABUNDANCE) | is_censored),
log2inty = ifelse(!(is.na(ABUNDANCE) |
if (has_censored) censored else FALSE),
ABUNDANCE, NA),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is a little confusing. Can it be simplified? The if statement in the middle of this makes it confusing

ifelse(!is.na(ABUNDANCE) & !(has_censored & censored), ABUNDANCE, NA)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOne

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it should be converted to a series of ":=" I think

Comment thread R/utils_normalize.R Outdated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed here?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "merged" column existed only so it could be glued onto the end of the run name.
Now "merged" is visibly just a suffix in the name

Comment thread R/utils_normalize.R
input$RUN = factor(input$RUN, levels=unique(input$RUN), labels=seq(1, length(unique(input$RUN))))
input = input[, !(colnames(input) %in% c('tmp','newRun')), with = FALSE]
}
} else {

@tonywu1999 tonywu1999 Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add unit tests for the MSstatsMergeFractions function? It seems to be a really confusing part and it'd be helpful to have test cases that verify its original behavior remains unchanged.

One more note: this section of code took me a really long time to figure out what was going on. If possible, could you look to simplify the code so that it's obvious what this function's intent is? I'd timeblock at most 30 minutes for simplifying the code and making the original intent obvious, I don't think this code block is used often by users.

@Rudhik1904 Rudhik1904 Jun 9, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The merged := "merged" column only existed so the next line could append a "merged" suffix to the run name via positional indexing: .SDcols = c(1:3, ncol(match_runs)). The non-obvious part is what those positions actually select. With, say, 4 fractions, match_runs (after dcast) looks like:

pos:        1                 2              3     4     5     6       7
     GROUP_ORIGINAL    SUBJECT_ORIGINAL    "1"   "2"   "3"   "4"    merged

So c(1:3, ncol) = positions 1, 2, 3, 7 = GROUP_ORIGINAL, SUBJECT_ORIGINAL, fraction 1 only, and merged. Fractions 2–4 are never used — the merged run is named after just the first fraction's run, even though the code looks like it might use all of them.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I focused on the non-TECHREPLICATE branch, which is the part this PR rewrote

Comment thread R/utils_output.R Outdated
summarized = data.table::rbindlist(protein_summaries, fill = TRUE)
rm(protein_summaries)

if (inherits(summarized, "try-error")) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does summarized inherit try-error here if there's no try-catch block for line 48?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't see a try-catch clause in the summarization code. Is there even a useful place for it there?
On a related note: we use withCallingHandlers for AFT call there, isn't it very slow?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it with: if (is.null(summarized))

Comment thread R/utils_output.R Outdated
@mstaniak

mstaniak commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Hi,
thanks for the updates, I'll review these changes on Monday

@Vitek-Lab Vitek-Lab deleted a comment from coderabbitai Bot Jun 8, 2026
Comment thread R/dataProcess.R
Comment on lines 416 to 424
if (is_labeled_reference) {
single_protein[, predicted := ifelse(censored & is_labeled_ref == FALSE, predicted, NA)]
single_protein[, newABUNDANCE := ifelse(censored & is_labeled_ref == FALSE, predicted, newABUNDANCE)]
single_protein[!(censored & is_labeled_ref == FALSE), predicted := NA]
single_protein[(censored) & is_labeled_ref == FALSE,
newABUNDANCE := predicted]
} else {
single_protein[, predicted := ifelse(censored, predicted, NA)]
single_protein[, newABUNDANCE := ifelse(censored, predicted, newABUNDANCE)]
single_protein[!(censored), predicted := NA]
single_protein[(censored), newABUNDANCE := predicted]
}

survival = single_protein[, intersect(c(cols, "LABEL", "predicted"), colnames(single_protein)), with = FALSE]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please

  • remove the "==FALSE" comparisons
  • check if the line 424 creates a copy that matters

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — replaced all six is_labeled_ref == FALSE comparisons with !is_labeled_ref (in both MSstatsSummarizeSingleLinear and MSstatsSummarizeSingleTMP).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with address() — single_protein[, keep, with = FALSE] does materialize a copy of the selected columns (they get new addresses, not shared). But it's a per-protein slice (a handful of columns × one protein's rows), not a whole-dataset copy, so the size is negligible relative to the copies this PR targets. It's also a necessary copy: survival is returned separately and, in the no-impute branch, gets survival[, predicted := NA] — sharing storage with single_protein would corrupt it. So I'd leave it as-is, but happy to revisit

Comment thread R/dataProcess.R
newABUNDANCE := predicted]
}
survival = single_protein[, intersect(c(cols, "LABEL", "predicted"), colnames(single_protein)), with = FALSE]
} else {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save as above

  • let's remove the "==FALSE" comparisons
  • let's see if line 580 creates a copy (of a considerable size, if any)

Comment thread R/utils_checks.R Outdated
drop_cols = setdiff(colnames(input), cols)
for (col in drop_cols) data.table::set(input, j = col, value = NULL)

input$PEPTIDE = paste(input$PEPTIDESEQUENCE,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"$" assignments should go. Line 215 is the way to go

Comment thread R/utils_feature_selection.R Outdated
log2inty = ifelse(!(is.na(ABUNDANCE) | is_censored),
log2inty = ifelse(!(is.na(ABUNDANCE) |
if (has_censored) censored else FALSE),
ABUNDANCE, NA),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it should be converted to a series of ":=" I think

Comment thread R/utils_output.R Outdated
summarized = data.table::rbindlist(protein_summaries, fill = TRUE)
rm(protein_summaries)

if (inherits(summarized, "try-error")) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't see a try-catch clause in the summarization code. Is there even a useful place for it there?
On a related note: we use withCallingHandlers for AFT call there, isn't it very slow?

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@man/dot-finalizeInput.Rd`:
- Around line 7-18: The Rd documentation is out of sync: update the \arguments
block for .finalizeInput to replace the stale item {summarized} with an entry
for {predicted_survival} that describes the new data.table contract;
specifically, ensure the arguments list includes input, predicted_survival,
method, impute, and censored_symbol and that the predicted_survival description
matches the function's expected data.table structure (columns/types) used by
.finalizeInput so \usage and \arguments are consistent with the function
signature.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eb8c80d9-00e4-4549-b689-e83302de9aac

📥 Commits

Reviewing files that changed from the base of the PR and between 0764b34 and 27d5f7c.

📒 Files selected for processing (6)
  • R/utils_feature_selection.R
  • R/utils_normalize.R
  • R/utils_output.R
  • inst/tinytest/test_MSstatsMergeFractions.R
  • man/dot-finalizeInput.Rd
  • man/dot-finalizeTMP.Rd
✅ Files skipped from review due to trivial changes (1)
  • man/dot-finalizeTMP.Rd
🚧 Files skipped from review as they are similar to previous changes (3)
  • R/utils_feature_selection.R
  • R/utils_normalize.R
  • R/utils_output.R

Comment thread man/dot-finalizeInput.Rd
Comment on lines +7 to +18
.finalizeInput(input, predicted_survival, method, impute, censored_symbol)
}
\arguments{
\item{input}{feature-level data}

\item{summarized}{protein-level data (list)}

\item{method}{summary method}

\item{impute}{if TRUE, censored missing values were imputed}

\item{censored_symbol}{censored missing value indicator}

\item{summarized}{protein-level data (list)}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Documented arguments are out of sync with the updated function signature.

Line 7 uses predicted_survival, but Line 18 still documents summarized and there is no predicted_survival argument entry. Please replace the stale summarized item with predicted_survival (and description matching the new data.table contract) so \usage and \arguments are consistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@man/dot-finalizeInput.Rd` around lines 7 - 18, The Rd documentation is out of
sync: update the \arguments block for .finalizeInput to replace the stale item
{summarized} with an entry for {predicted_survival} that describes the new
data.table contract; specifically, ensure the arguments list includes input,
predicted_survival, method, impute, and censored_symbol and that the
predicted_survival description matches the function's expected data.table
structure (columns/types) used by .finalizeInput so \usage and \arguments are
consistent with the function signature.

@Rudhik1904

Rudhik1904 commented Jun 9, 2026

Copy link
Copy Markdown

@mstaniak

For issue: #208 (comment)

I looked into this but think the current input[, list(...)] is actually the right call here, because of the following:

  1. In-place := on input would corrupt the caller. .selectHighQualityFeatures is called from MSstatsSelectFeatures, which reuses the same input immediately afterwards:
features_quality = .selectHighQualityFeatures(input, min_feature_count)
input = merge(input, features_quality,
              by.x = c("LABEL", "PROTEIN", "FEATURE", "originalRUN"),
              by.y = c("label", "protein", "feature", "run"))

Since data.table :=/setnames modify by reference across function boundaries, renaming the columns to lowercase (or adding log2inty/is_obs) inside the function would mutate the caller's table and break that merge (its by.x keys would no longer exist). The input[, list(...)] form avoids this by returning a new table and only rebinding the local name.

  1. It's already a minimal projection, not a full-table copy. list(...) materializes only the 7 columns the feature-selection step needs — it's not the kind of whole-table duplication this PR is targeting.

happy to reevaluate my decision if I miss something.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/utils_checks.R`:
- Line 173: The AnomalyScores column is being initialized with plain NA which
creates a logical column; update the call to data.table::set in R/utils_checks.R
(the line that sets j = "AnomalyScores") to use NA_real_ instead of NA so the
column type matches downstream numeric expectations (see
.updateColumnsForProcessing which uses NA_real_ and the uppercasing step that
may leave the column present).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a9ac4369-d3ff-416f-8dc1-4fdc7f912e4e

📥 Commits

Reviewing files that changed from the base of the PR and between 27d5f7c and 50a30da.

📒 Files selected for processing (3)
  • R/dataProcess.R
  • R/utils_checks.R
  • inst/tinytest/test_memory_optimization_copies.R
🚧 Files skipped from review as they are similar to previous changes (2)
  • R/dataProcess.R
  • inst/tinytest/test_memory_optimization_copies.R

Comment thread R/utils_checks.R

if (!"AnomalyScores" %in% colnames(input)){
input$AnomalyScores = NA
data.table::set(input, j = "AnomalyScores", value = NA)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use NA_real_ for type consistency with downstream numeric AnomalyScores.

Using plain NA creates a logical-typed column. Since ANOMALYSCORES is expected to be numeric (values like 0.03, 0.01 per tests), and .updateColumnsForProcessing at line 318 uses NA_real_, this should match.

After column names are uppercased at line 198, the check at line 317 will find the column already exists and skip the NA_real_ assignment, leaving the column as logical.

Proposed fix
-      data.table::set(input, j = "AnomalyScores", value = NA)
+      data.table::set(input, j = "AnomalyScores", value = NA_real_)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/utils_checks.R` at line 173, The AnomalyScores column is being initialized
with plain NA which creates a logical column; update the call to data.table::set
in R/utils_checks.R (the line that sets j = "AnomalyScores") to use NA_real_
instead of NA so the column type matches downstream numeric expectations (see
.updateColumnsForProcessing which uses NA_real_ and the uppercasing step that
may leave the column present).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants