Removed deep-copy data.table ops from the dataProcess pipeline#208
Removed deep-copy data.table ops from the dataProcess pipeline#208tonywu1999 wants to merge 7 commits into
Conversation
* Replaced `input[, cols, with = FALSE]` deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL). * Replaced row-shuffle `input = input[order(...), ]` in .prepareForDataProcess with data.table::setorder() (in place). * Replaced merge(all.x = TRUE) joins in MSstatsMergeFractions and .finalizeTMP with keyed-which lookups + data.table::set() writes — avoids deep-copying the whole table. * Replaced the synthesised `tmp` string-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call. * Replaced ifelse() full-vector writes for predicted/newABUNDANCE and nonmissing_orig with targeted [i, j := v] in-place writes. * Collapsed the two-step subset+transform in .selectHighQualityFeatures into a single pass to eliminate one intermediate data.table copy. * Reworked MSstatsSummarizationOutput to extract predicted_survival upfront and null per-protein second slots so the nested-list duplication is freed before .finalizeTMP runs; switched the final return to data.table::setDF() in place of as.data.frame(). * Fixed two regressions in the original commit: (1) .finalizeTMP's join_cols must intersect with predicted_survival's columns so the keyed lookup doesn't error on missing LABEL; (2) reverted the survival-column-selection tightening that dropped LABEL — a downstream test in test_dataProcess.R relies on LABEL being kept. * Tests: inst/tinytest/test_memory_optimization_copies.R Issues 2/3/4 — 28 assertions, all green. Full suite 224/224 OK. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>
📝 WalkthroughWalkthroughRefactor to perform in-place data.table updates, extract predicted_survival from summarization outputs and pass it into finalizers, replace merges/ifelse with indexed := assignments, and add tinytests validating in-place and merge behaviors. ChangesDataProcess Pipeline Memory Optimization and Output Refactoring
Sequence DiagramsNo sequence diagrams generated. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
|
Great update, thanks. Did you get a chance to evaluate the memory gain with lineprof or lobstr? |
mstaniak
left a comment
There was a problem hiding this comment.
Hi,
thanks again for this update. I have a few minor comments
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
inst/tinytest/test_memory_optimization_copies.R (1)
328-369: ⚡ Quick winAdd a mixed-
LABELfixture to this contract test.These assertions only exercise
LABEL = "L", so a regression that dropsLABELfrom the survival projection or join keys would still pass here. A smallL/Hfixture with duplicated(RUN, FEATURE)values would cover the regression this stack is guarding against.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@inst/tinytest/test_memory_optimization_copies.R` around lines 328 - 369, The test only uses LABEL = "L", so add mixed LABEL values and duplicated (RUN, FEATURE) combos to both finalize_input_4 and pred_surv_4 to exercise join keys: modify finalize_input_4$LABEL to contain a small mixture (e.g. "L" and "H" as a factor) with duplicated RUN/FEATURE pairs across labels, and add a LABEL column to pred_surv_4 with matching L/H entries (and duplicate RUN/FEATURE rows) so MSstats:::.finalizeTMP must preserve/join on LABEL; keep result_4 assertions but ensure the fixture includes those mixed-label cases to catch regressions that drop LABEL from survival projection or join keys.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 206-212: The test currently counts all non-NA newABUNDANCE values
(matched_count) which can include rows that were already populated; instead
capture the rows that started with newABUNDANCE = NA before calling
.finalizeTMP() (e.g. store original_na_idx <-
is.na(original_result$newABUNDANCE)), then after .finalizeTMP() assert that
result$newABUNDANCE[original_na_idx] are non-NA and equal to the expected
imputed values from predicted_survival (use the (cen, RUN, FEATURE) key to
compare), replacing the generic expect_true(matched_count > 0) with direct
checks on those indices.
In `@R/utils_output.R`:
- Around line 41-49: Check whether summarized contains a "try-error" result
before accessing x[[1]]/x[[2]]: if any element of summarized inherits from
"try-error" (the fallback path intended for failed
MSstatsSummarizeWithSingleCore()), do not rbind or unpack
predicted_survival/protein_summaries; instead invoke the existing fallback
behavior (the same path currently guarded at the later check) and avoid calling
.finalizeInput on invalid data. Update the block that builds predicted_survival
and protein_summaries to first detect try-error in summarized and branch to the
fallback handling when present, referencing the summarized variable and the
.finalizeInput call to ensure invalid summary results are not unpacked.
- Around line 101-102: The calls to data.table::setDF(input) and
data.table::setDF(rqall) mutate caller-owned objects in place; update
MSstatsSummarizationOutput to avoid by-reference mutation by operating on copies
instead (e.g., create local copies like input_copy <- input and rqall_copy <-
rqall or coerce with as.data.frame() on copies) and call data.table::setDF() (or
as.data.frame) on those copies so the original input and rqall keep their
data.table class; ensure all subsequent uses in the function reference the
copied variables (input_copy, rqall_copy) rather than the originals.
---
Nitpick comments:
In `@inst/tinytest/test_memory_optimization_copies.R`:
- Around line 328-369: The test only uses LABEL = "L", so add mixed LABEL values
and duplicated (RUN, FEATURE) combos to both finalize_input_4 and pred_surv_4 to
exercise join keys: modify finalize_input_4$LABEL to contain a small mixture
(e.g. "L" and "H" as a factor) with duplicated RUN/FEATURE pairs across labels,
and add a LABEL column to pred_surv_4 with matching L/H entries (and duplicate
RUN/FEATURE rows) so MSstats:::.finalizeTMP must preserve/join on LABEL; keep
result_4 assertions but ensure the fixture includes those mixed-label cases to
catch regressions that drop LABEL from survival projection or join keys.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2610ce77-e553-465d-8347-26a794f1b249
📒 Files selected for processing (6)
R/dataProcess.RR/utils_checks.RR/utils_feature_selection.RR/utils_normalize.RR/utils_output.Rinst/tinytest/test_memory_optimization_copies.R
| # (b) Matched rows should have newABUNDANCE from predicted_survival. | ||
| # Any row whose (cen, RUN, FEATURE) key exists in predicted_survival | ||
| # should get a non-NA newABUNDANCE value. | ||
| matched_count = sum(!is.na(result$newABUNDANCE)) | ||
| expect_true(matched_count > 0, | ||
| info = paste("Matched rows should have non-NA newABUNDANCE.", | ||
| "Found", matched_count, "non-NA values")) |
There was a problem hiding this comment.
Assert on the imputed rows directly.
matched_count includes rows that were already non-NA before .finalizeTMP(), so this still passes if the join never writes the censored rows. Check exact expected values on the rows that start with newABUNDANCE = NA instead.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@inst/tinytest/test_memory_optimization_copies.R` around lines 206 - 212, The
test currently counts all non-NA newABUNDANCE values (matched_count) which can
include rows that were already populated; instead capture the rows that started
with newABUNDANCE = NA before calling .finalizeTMP() (e.g. store original_na_idx
<- is.na(original_result$newABUNDANCE)), then after .finalizeTMP() assert that
result$newABUNDANCE[original_na_idx] are non-NA and equal to the expected
imputed values from predicted_survival (use the (cen, RUN, FEATURE) key to
compare), replacing the generic expect_true(matched_count > 0) with direct
checks on those indices.
There was a problem hiding this comment.
On a second thought - let's check the total count, not the minimum here.
There was a problem hiding this comment.
(as in: comparing to 0 checks if anything is there, lets compare to the total expected number of non-NA rows, aka "==" check)
There was a problem hiding this comment.
@mstaniak — makes sense. Based on the fixture, 8 rows start non-NA and predicted_survival covers 3 of the 4 originally-NA rows (row 11's (RUN=2, FEATURE=feat2) for P2 is intentionally excluded), so the expected total is 11:
# (b) After imputation, non-NA count should equal the number of rows
# covered by predicted_survival plus the rows that were already non-NA.
# Row 11 (P2 / feat2 / RUN=2) is intentionally excluded, so it stays NA.
expect_equal(matched_count, 11L,
info = paste("Expected 11 non-NA newABUNDANCE values after imputation.",
"Found", matched_count))This is a strict equality check and will catch both "nothing got imputed" and "too many rows were filled" regressions.
🧠 Learnings used
Learnt from: tonywu1999
Repo: Vitek-Lab/MSstats PR: 192
File: R/utils_censored.R:34-35
Timestamp: 2026-04-13T14:51:35.558Z
Learning: In MSstats (R package, `R/utils_censored.R`), the `use_for_analysis` fallback in `MSstatsHandleMissing` intentionally uses `rep(TRUE, nrow(input))` when the `ref` column is absent. SRM experiments with heavy labels are required to use `EQUALIZEMEDIANS` normalization, which is the only path that adds the `ref` column. Other normalization methods (QUANTILE, GLOBALSTANDARDS, FALSE/NONE) are not appropriate for labeled experiments with heavy references, so when `ref` is missing the data is label-free and applying censoring to all rows is correct. A fallback to `input$LABEL == "L"` is unnecessary.
There was a problem hiding this comment.
@mstaniak, got it — the total expected non-NA count after imputation is 11 (8 rows that were already non-NA + 3 censored rows that have a match in predicted_survival; only row 11 — P2/feat2/RUN=2 — stays NA because it was deliberately excluded).
Replace the assertion:
# (b) All rows that can be imputed should have non-NA newABUNDANCE.
# 8 rows were already non-NA; of the 4 censored rows (2, 6, 8, 11),
# rows 2/6/8 match in predicted_survival and row 11 (P2/feat2/RUN=2) was
# deliberately excluded → 11 total non-NA values expected.
expected_non_na = 11L
matched_count = sum(!is.na(result$newABUNDANCE))
expect_equal(matched_count, expected_non_na,
info = paste("Expected", expected_non_na,
"non-NA newABUNDANCE values after imputation.",
"Got:", matched_count))This makes the check exact: it fails if even one eligible censored row was not filled in, rather than just verifying "at least something happened."
🧠 Learnings used
Learnt from: tonywu1999
Repo: Vitek-Lab/MSstats PR: 192
File: R/utils_censored.R:34-35
Timestamp: 2026-04-13T14:51:35.558Z
Learning: In MSstats (R package, `R/utils_censored.R`), the `use_for_analysis` fallback in `MSstatsHandleMissing` intentionally uses `rep(TRUE, nrow(input))` when the `ref` column is absent. SRM experiments with heavy labels are required to use `EQUALIZEMEDIANS` normalization, which is the only path that adds the `ref` column. Other normalization methods (QUANTILE, GLOBALSTANDARDS, FALSE/NONE) are not appropriate for labeled experiments with heavy references, so when `ref` is missing the data is label-free and applying censoring to all rows is correct. A fallback to `input$LABEL == "L"` is unnecessary.
There was a problem hiding this comment.
Good point - switched it to an exact == count check. It now computes the number of (cen, RUN, FEATURE) keys actually present in predicted_survival and asserts that newABUNDANCE has exactly that many non-NA values:
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| input[, is_obs := !(is.na(log2inty) | is_censored)] | ||
| input[, is_censored := NULL] | ||
|
|
||
| is_obs = FALSE)] |
There was a problem hiding this comment.
What's the point of setting is_obs = FALSE if is_obs is set again in line 90?
| log2inty = ifelse(!(is.na(ABUNDANCE) | is_censored), | ||
| log2inty = ifelse(!(is.na(ABUNDANCE) | | ||
| if (has_censored) censored else FALSE), | ||
| ABUNDANCE, NA), |
There was a problem hiding this comment.
This statement is a little confusing. Can it be simplified? The if statement in the middle of this makes it confusing
ifelse(!is.na(ABUNDANCE) & !(has_censored & censored), ABUNDANCE, NA)
There was a problem hiding this comment.
Also it should be converted to a series of ":=" I think
There was a problem hiding this comment.
Why is this needed here?
There was a problem hiding this comment.
The "merged" column existed only so it could be glued onto the end of the run name.
Now "merged" is visibly just a suffix in the name
| input$RUN = factor(input$RUN, levels=unique(input$RUN), labels=seq(1, length(unique(input$RUN)))) | ||
| input = input[, !(colnames(input) %in% c('tmp','newRun')), with = FALSE] | ||
| } | ||
| } else { |
There was a problem hiding this comment.
Could you add unit tests for the MSstatsMergeFractions function? It seems to be a really confusing part and it'd be helpful to have test cases that verify its original behavior remains unchanged.
One more note: this section of code took me a really long time to figure out what was going on. If possible, could you look to simplify the code so that it's obvious what this function's intent is? I'd timeblock at most 30 minutes for simplifying the code and making the original intent obvious, I don't think this code block is used often by users.
There was a problem hiding this comment.
The merged := "merged" column only existed so the next line could append a "merged" suffix to the run name via positional indexing: .SDcols = c(1:3, ncol(match_runs)). The non-obvious part is what those positions actually select. With, say, 4 fractions, match_runs (after dcast) looks like:
pos: 1 2 3 4 5 6 7
GROUP_ORIGINAL SUBJECT_ORIGINAL "1" "2" "3" "4" merged
So c(1:3, ncol) = positions 1, 2, 3, 7 = GROUP_ORIGINAL, SUBJECT_ORIGINAL, fraction 1 only, and merged. Fractions 2–4 are never used — the merged run is named after just the first fraction's run, even though the code looks like it might use all of them.
There was a problem hiding this comment.
Also, I focused on the non-TECHREPLICATE branch, which is the part this PR rewrote
| summarized = data.table::rbindlist(protein_summaries, fill = TRUE) | ||
| rm(protein_summaries) | ||
|
|
||
| if (inherits(summarized, "try-error")) { |
There was a problem hiding this comment.
how does summarized inherit try-error here if there's no try-catch block for line 48?
There was a problem hiding this comment.
Yeah, I don't see a try-catch clause in the summarization code. Is there even a useful place for it there?
On a related note: we use withCallingHandlers for AFT call there, isn't it very slow?
There was a problem hiding this comment.
fixed it with: if (is.null(summarized))
|
Hi, |
| if (is_labeled_reference) { | ||
| single_protein[, predicted := ifelse(censored & is_labeled_ref == FALSE, predicted, NA)] | ||
| single_protein[, newABUNDANCE := ifelse(censored & is_labeled_ref == FALSE, predicted, newABUNDANCE)] | ||
| single_protein[!(censored & is_labeled_ref == FALSE), predicted := NA] | ||
| single_protein[(censored) & is_labeled_ref == FALSE, | ||
| newABUNDANCE := predicted] | ||
| } else { | ||
| single_protein[, predicted := ifelse(censored, predicted, NA)] | ||
| single_protein[, newABUNDANCE := ifelse(censored, predicted, newABUNDANCE)] | ||
| single_protein[!(censored), predicted := NA] | ||
| single_protein[(censored), newABUNDANCE := predicted] | ||
| } | ||
|
|
||
| survival = single_protein[, intersect(c(cols, "LABEL", "predicted"), colnames(single_protein)), with = FALSE] |
There was a problem hiding this comment.
Please
- remove the "==FALSE" comparisons
- check if the line 424 creates a copy that matters
There was a problem hiding this comment.
Done — replaced all six is_labeled_ref == FALSE comparisons with !is_labeled_ref (in both MSstatsSummarizeSingleLinear and MSstatsSummarizeSingleTMP).
There was a problem hiding this comment.
Checked with address() — single_protein[, keep, with = FALSE] does materialize a copy of the selected columns (they get new addresses, not shared). But it's a per-protein slice (a handful of columns × one protein's rows), not a whole-dataset copy, so the size is negligible relative to the copies this PR targets. It's also a necessary copy: survival is returned separately and, in the no-impute branch, gets survival[, predicted := NA] — sharing storage with single_protein would corrupt it. So I'd leave it as-is, but happy to revisit
| newABUNDANCE := predicted] | ||
| } | ||
| survival = single_protein[, intersect(c(cols, "LABEL", "predicted"), colnames(single_protein)), with = FALSE] | ||
| } else { |
There was a problem hiding this comment.
save as above
- let's remove the "==FALSE" comparisons
- let's see if line 580 creates a copy (of a considerable size, if any)
| drop_cols = setdiff(colnames(input), cols) | ||
| for (col in drop_cols) data.table::set(input, j = col, value = NULL) | ||
|
|
||
| input$PEPTIDE = paste(input$PEPTIDESEQUENCE, |
There was a problem hiding this comment.
"$" assignments should go. Line 215 is the way to go
| log2inty = ifelse(!(is.na(ABUNDANCE) | is_censored), | ||
| log2inty = ifelse(!(is.na(ABUNDANCE) | | ||
| if (has_censored) censored else FALSE), | ||
| ABUNDANCE, NA), |
There was a problem hiding this comment.
Also it should be converted to a series of ":=" I think
| summarized = data.table::rbindlist(protein_summaries, fill = TRUE) | ||
| rm(protein_summaries) | ||
|
|
||
| if (inherits(summarized, "try-error")) { |
There was a problem hiding this comment.
Yeah, I don't see a try-catch clause in the summarization code. Is there even a useful place for it there?
On a related note: we use withCallingHandlers for AFT call there, isn't it very slow?
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@man/dot-finalizeInput.Rd`:
- Around line 7-18: The Rd documentation is out of sync: update the \arguments
block for .finalizeInput to replace the stale item {summarized} with an entry
for {predicted_survival} that describes the new data.table contract;
specifically, ensure the arguments list includes input, predicted_survival,
method, impute, and censored_symbol and that the predicted_survival description
matches the function's expected data.table structure (columns/types) used by
.finalizeInput so \usage and \arguments are consistent with the function
signature.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: eb8c80d9-00e4-4549-b689-e83302de9aac
📒 Files selected for processing (6)
R/utils_feature_selection.RR/utils_normalize.RR/utils_output.Rinst/tinytest/test_MSstatsMergeFractions.Rman/dot-finalizeInput.Rdman/dot-finalizeTMP.Rd
✅ Files skipped from review due to trivial changes (1)
- man/dot-finalizeTMP.Rd
🚧 Files skipped from review as they are similar to previous changes (3)
- R/utils_feature_selection.R
- R/utils_normalize.R
- R/utils_output.R
| .finalizeInput(input, predicted_survival, method, impute, censored_symbol) | ||
| } | ||
| \arguments{ | ||
| \item{input}{feature-level data} | ||
|
|
||
| \item{summarized}{protein-level data (list)} | ||
|
|
||
| \item{method}{summary method} | ||
|
|
||
| \item{impute}{if TRUE, censored missing values were imputed} | ||
|
|
||
| \item{censored_symbol}{censored missing value indicator} | ||
|
|
||
| \item{summarized}{protein-level data (list)} |
There was a problem hiding this comment.
Documented arguments are out of sync with the updated function signature.
Line 7 uses predicted_survival, but Line 18 still documents summarized and there is no predicted_survival argument entry. Please replace the stale summarized item with predicted_survival (and description matching the new data.table contract) so \usage and \arguments are consistent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@man/dot-finalizeInput.Rd` around lines 7 - 18, The Rd documentation is out of
sync: update the \arguments block for .finalizeInput to replace the stale item
{summarized} with an entry for {predicted_survival} that describes the new
data.table contract; specifically, ensure the arguments list includes input,
predicted_survival, method, impute, and censored_symbol and that the
predicted_survival description matches the function's expected data.table
structure (columns/types) used by .finalizeInput so \usage and \arguments are
consistent with the function signature.
|
For issue: #208 (comment) I looked into this but think the current input[, list(...)] is actually the right call here, because of the following:
Since data.table :=/setnames modify by reference across function boundaries, renaming the columns to lowercase (or adding log2inty/is_obs) inside the function would mutate the caller's table and break that merge (its by.x keys would no longer exist). The input[, list(...)] form avoids this by returning a new table and only rebinding the local name.
happy to reevaluate my decision if I miss something. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/utils_checks.R`:
- Line 173: The AnomalyScores column is being initialized with plain NA which
creates a logical column; update the call to data.table::set in R/utils_checks.R
(the line that sets j = "AnomalyScores") to use NA_real_ instead of NA so the
column type matches downstream numeric expectations (see
.updateColumnsForProcessing which uses NA_real_ and the uppercasing step that
may leave the column present).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a9ac4369-d3ff-416f-8dc1-4fdc7f912e4e
📒 Files selected for processing (3)
R/dataProcess.RR/utils_checks.Rinst/tinytest/test_memory_optimization_copies.R
🚧 Files skipped from review as they are similar to previous changes (2)
- R/dataProcess.R
- inst/tinytest/test_memory_optimization_copies.R
|
|
||
| if (!"AnomalyScores" %in% colnames(input)){ | ||
| input$AnomalyScores = NA | ||
| data.table::set(input, j = "AnomalyScores", value = NA) |
There was a problem hiding this comment.
Use NA_real_ for type consistency with downstream numeric AnomalyScores.
Using plain NA creates a logical-typed column. Since ANOMALYSCORES is expected to be numeric (values like 0.03, 0.01 per tests), and .updateColumnsForProcessing at line 318 uses NA_real_, this should match.
After column names are uppercased at line 198, the check at line 317 will find the column already exists and skip the NA_real_ assignment, leaving the column as logical.
Proposed fix
- data.table::set(input, j = "AnomalyScores", value = NA)
+ data.table::set(input, j = "AnomalyScores", value = NA_real_)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/utils_checks.R` at line 173, The AnomalyScores column is being initialized
with plain NA which creates a logical column; update the call to data.table::set
in R/utils_checks.R (the line that sets j = "AnomalyScores") to use NA_real_
instead of NA so the column type matches downstream numeric expectations (see
.updateColumnsForProcessing which uses NA_real_ and the uppercasing step that
may leave the column present).
User description
input[, cols, with = FALSE]deep-copy in MSstatsPrepareForDataProcess and MSstatsSummarizationOutput with drop-cols loops via data.table::set(j = ..., value = NULL).input = input[order(...), ]in .prepareForDataProcess with data.table::setorder() (in place).tmpstring-join filter in MSstatsMergeFractions with a direct (FEATURE, FRACTION) keyed lookup; drops two large character vectors and a paste() call.See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md
Motivation and Context
Please include relevant motivation and context of the problem along with a short summary of the solution.
Changes
Please provide a detailed bullet point list of your changes.
Testing
Please describe any unit tests you added or modified to verify your changes.
Checklist Before Requesting a Review
PR Type
Enhancement, Bug fix, Tests
Description
Replace full-table copies with in-place updates
Use keyed lookups for joins
Split survival outputs before finalization
Add memory-regression pipeline tests
Diagram Walkthrough
File Walkthrough
dataProcess.R
Limit censored-value updates to matching rowsR/dataProcess.R
ifelse()rewrites with targeted:=updatespredictedon applicable censored rowsnewABUNDANCEwhere imputation appliesutils_checks.R
Avoid copies during input trimming and sortingR/utils_checks.R
data.table::set(..., NULL)data.table::setorder()utils_feature_selection.R
Collapse feature preprocessing into one passR/utils_feature_selection.R
censoredvalues inlineis_obswithout intermediate tablesutils_normalize.R
Use in-place cleanup and keyed fraction mergesR/utils_normalize.R
merge()with keyednewRunassignment(FEATURE, FRACTION)lookuputils_output.R
Streamline summary output and imputation joinsR/utils_output.R
predicted_survivalbefore finalizationtest_memory_optimization_copies.R
Add memory regression tests for copy avoidanceinst/tinytest/test_memory_optimization_copies.R
.normalizeMediantemp-column cleanup.finalizeTMPkeyed matches and unmatchedNAsMSstatsSummarizationOutputlist splitting behaviorMotivation and Context — Short summary of the solution
The dataProcess pipeline used several copy-heavy data.table idioms (column-subset deep copies, merge(all.x=TRUE) joins, order-based reassignments, and full-vector ifelse writes) that caused excessive memory allocations. This PR replaces those with in-place data.table operations (data.table::set(..., value = NULL), data.table::setorder(), keyed which lookups + data.table::set, and targeted [i, j := v] updates) to reduce memory churn while preserving behavior. It also extracts predicted_survival earlier in summarization, nulls nested survival slots, fixes two regressions (predicted_survival join-columns intersection and retention of LABEL), updates documentation for changed parameters, and adds memory-regression tinytests. Full test suite passes.
Detailed changes
General
R/dataProcess.R
R/utils_checks.R
R/utils_feature_selection.R
R/utils_normalize.R
R/utils_output.R
Regressions fixed
Documentation
Unit tests added / modified
Coding guidelines / potential violations