refactor: uniformize somatic-variant definition across bin scripts#469
refactor: uniformize somatic-variant definition across bin scripts#469m-huertasp wants to merge 9 commits into
Conversation
- style: black-format filter cohort and check contamination
FerriolCalvet
left a comment
There was a problem hiding this comment.
looks good, there is only this comment on the check_contamination script that if you can address it to increase the clarity that would be great, if you think that it is clear enough I am happy to accept that as well
There was a problem hiding this comment.
I think that there is something that pisses me off a bit from this script and it is that the sequence of plots in the code is a bit confusing.
Could you give it a try and split the main function into different ones? I can give it a try as well otherwise.
| contamination_vaf_threshold = 0.05 | ||
| somatic_snp_positions_maf = snp_positions_maf.loc[ | ||
| somatic_mask(snp_positions_maf, contamination_vaf_threshold) | ||
| ].reset_index(drop=True) | ||
| germline_snp_positions_maf = snp_positions_maf.loc[ | ||
| germline_mask(snp_positions_maf, contamination_vaf_threshold) | ||
| ].reset_index(drop=True) |
There was a problem hiding this comment.
I am not sure that this has the same behaviour than before since there might be some mutations that stay in between the two
There was a problem hiding this comment.
maybe inverting the somatic_mask would be the best way to go?
There was a problem hiding this comment.
it does not have the same behaviour than before because we are using the three VAFs and we can loose mutations that are VAF > 0.05 but vd_VAF <0.05. Would it be best to use the previous strategy in this case?
There was a problem hiding this comment.
I am not sure that it makes a big difference, but probably yes, it is better to follow the same strategy as before.
I think that my suggestion of using the somatic_mask and then "inverting" the selection should work well and would not be too difficult to apply.
Still to doFor when I come back, or if someone does this. 1. Address Ferriol's commentFerriol flagged that the new # Current code (~line 459):
somatic_snp_positions_maf = snp_positions_maf.loc[somatic_mask(snp_positions_maf, contamination_vaf_threshold)]
germline_snp_positions_maf = snp_positions_maf.loc[germline_mask(snp_positions_maf, contamination_vaf_threshold)]Either go back to previous strategy or invert somatic_mask for the germline side, something like: somatic_snp_mask = somatic_mask(snp_positions_maf, contamination_vaf_threshold)
somatic_snp_positions_maf = snp_positions_maf.loc[somatic_snp_mask].reset_index(drop=True)
germline_snp_positions_maf = snp_positions_maf.loc[~somatic_snp_mask].reset_index(drop=True)2. Refactor plot structurePlan for doing this:
|
Small PR.
What this does
Determines single definition of a somatic variant and applies it
consistently across the
bin/scripts. Previously the somatic/germline split wasre-implemented inline in several places with divergent forms (single-column
VAFin
filter_cohort.py, three-column elsewhere, plus hardcoded thresholds incheck_contamination.py), which made the somatic and germline sets non-symmetric.The definition now lives in two shared helpers and every call site uses them. Closes #418.
Changes
somatic_mask/germline_masktobin/utils_filter.py— the canonical all-3-columnrule (
VAF,vd_VAF,VAF_AMvs a caller-supplied threshold), with NumPy docstrings.bin/filter_cohort.pyto use the helpers (somatic flagging goes single-column → all-3-column).bin/check_contamination.py: germline/SNP predicates use the helpers; the hardcoded0.2threshold is replaced by a new--somatic-vaf-boundaryCLI option fed fromparams.germline_threshold(wired viamodules/local/contamination/main.nf+conf/modules.config);NumPy docstrings added to all public functions.
bin/test/test_utils_filter.py— 46 unit tests covering every public function inutils_filter.py.What to review
filter_cohortsomatic flagging is now all-3-column; thebetween-samples contamination threshold moves
0.2 → 0.3(defaultgermline_threshold).Testing
pytest bin/test/test_utils_filter.py→ 46 passed + ruff clean on all touched files.The full
bin/test/suite has 3 pre-existing failures + 1 collection error in unrelated files(
test_plot_selectionsideplots.py,test_check_samplesheet.py) that are untouched by this PR.