Pedigree phasing with whatshap (issue 387)#968
Pedigree phasing with whatshap (issue 387)#968tetedange13 wants to merge 41 commits intogenomic-medicine-sweden:devfrom
Conversation
Default to 'none' for now
Otherwise wrong PED could be used when multiple families
Simplify tests update
Should not be impacted by change in whatshap
Separate commit because parts of snapshot changes seem to come from other unrelated commits (eg: PR 948 ?)
PR checklist
|
fellen31
left a comment
There was a problem hiding this comment.
Thanks a lot for this contribution!
- No test in
subworkflows/whatshapto capture this new behaviour ? Would require 2 trios, or best 2 trios + 1 solo ? Is there enough test data on genomic-medicine-sweden/test-datasets for that ?
New tests would be great! And I think so! Since whatshap phases one family at a time, we should be able to reuse the input files but with different family/sample IDs. So you could probably use the PacBio HG002, HG003 and HG004 files twice for trios, and then pick one of the files for an extra solo.
- A new test sampleSheet with 2 trios + 1 solo could be a good add too ?
I think if you can add 2 trios + 1 solo in the subworkflow test, we should be fine with the current tests. The only logic changed in nallo.nf is whether or not we pass the PED-file right?
I was also wondering if we should have a parameter controlling whether or not PED phasing is applied for trios...but I guess there's no downside to it, so maybe that's just unnecessary?
Would be interesting to hear if phasing a trio with whatshap ped is better than e.g. longphase. Have you tried?
There was a problem hiding this comment.
This update would be great otherwise have in the nf-core module. If you make a PR to nf-core/modules with these updates you can ping me for a review in the nf-core Slack. Should also take care of issues with the linting.
There was a problem hiding this comment.
Latest version of whatshap/phase module changed input structure : https://github.com/nf-core/modules/pull/11041/changes#diff-04d6148de5293b82d05a67c61d9995e47c020e76e0f16f950dac131382e62e62
But I am on it !
There was a problem hiding this comment.
PR to add pedigree input to nf-core/whatshap/phase : nf-core/modules#11128
EDIT : Will probably take me longer than expected to adress this PR, so in the meantime I updated whatshap/phase (with new inputs structure) and re-patched it with ped input
workflows/nallo.nf
Outdated
| .map { meta, _files -> [ [ id: meta.family_id ], meta ] } | ||
| .groupTuple() | ||
| ) | ||
| // If 'childWithTwoParents==false', set family_ped=empty |
There was a problem hiding this comment.
Nice, but could you explain why we need to do this? Will whatshap fail if you give it a PED with only two individuals in a family?
| // If 'childWithTwoParents==false', set family_ped=empty | |
| // If childWithTwoParents is false we don't provide the ped file, because ... |
There was a problem hiding this comment.
Based on Whatshap documentation, pedigree phasing activates a distinct algorithm
So I enforced it to "full trio", in one hand to not risk worse results on cases like "duos" or even "solos". And in the other hand to reserve pedigree phasing for when it is most relevant (= when we have both parents I guess)
But I did actually made a benchmark of all that, to be honest
There was a problem hiding this comment.
Then I would write something like "If we provide a PED file, whatshap phase will automatically activate the pedigree phasing algorithm which is only available for trios..", to make it more clear.
However, now that you link the documentation it seems like it also supports quartets (which is fine - childWithTwoParents will be true for both trios and quartets), but also duos (parent/childs):
A quartet (note how multiple consecutive spaces are fine):
When phasing multiple samples from individuals that are related (such as parent/child or a trio), then it is possible to provide WhatsHap with a .ped file that describes the pedigree. WhatsHap will use the pedigree and the reads to infer a combined, much better phasing.
Based on this line in the WhatsHap code: https://github.com/whatshap/whatshap/blob/f07423a7ba0c7609e7d4d5c73fb5e4e240057830/whatshap/cli/phase.py#L580, it seems to me like it would just ignore the PED file for singletons. So maybe we could always provide the PED file?
If we are want to be able to control the algorithm when we have parent/child, perhaps we could have a pipeline parameter, e.g. whatshap_pedigree_phasing where the user can determine if we provide the PED file or not? (edit: That would default to true)
There was a problem hiding this comment.
Commit 7930be2 fix an important typo between param_name and val_param_name...
Pass PED file in one of them, to test new 'whatshap pedigree phasing modifié : subworkflows/local/whatshap/tests/main.nf.test modifié : subworkflows/local/whatshap/tests/main.nf.test.snap
569b2a8 to
6512c02
Compare
This reverts commit 6512c02.
Pass PED file in one of them, to test new 'whatshap pedigree phasing
+ Forgotten debug 'view()'
| input[5] = channel.of([ | ||
| [ id:'FAM' ], // Empty ch_pedigree | ||
| [] | ||
| ]) |
There was a problem hiding this comment.
Could we try inputting the PED-file here as well?
There was a problem hiding this comment.
Smart idea testing that !
- It works, but snapshots do not match anymore
- However I think this a false positive, simply due to use of simple
snapshot()instead of dedicated nft-vcf plugin'svariantsMD5()(by manual check VCFs seem identical, except for add of--ped FAM.pedin header)
Plot twist is that looking at logs when inputting PED :
- I see
Using uniform recombination rate of 1.26 cM/Mb.(absent when no PED), probably due to https://github.com/whatshap/whatshap/blob/f07423a7ba0c7609e7d4d5c73fb5e4e240057830/whatshap/cli/phase.py#L440, wherelen(family)is not checked - But at the same time
solving MEC problem(contrary toPedMEC problemwhen inputting PED with trio)
So I guess that PED is well ignored when given on a solo
There was a problem hiding this comment.
I also tested on a bigger dataset (100 K variants) and whatshap phase still produces identical results on a solo with or without --ped option --> Confirm we should be fine, PED is well ignored with solo
(even I cannot exactly pinpoint this behaviour in whatshap phase's code)
Do we keep it like that or do you prefer to play safe and get back to initial "phasing for complete trios ONLY" option ?
| val_phaser, | ||
| !val_skip_sv_calling, | ||
| cram_output, | ||
| ch_ped_family |
There was a problem hiding this comment.
- I think we could always pass the ped file here (unless we do not want pedigree phasing), even though it will only be used by whatshap.
- I prefer always passing the meta along with the file, the empty equivalent would be
[[],[]] - We recently moved all params from this workflow to main.nf, since this will be required in future versions of Nextflow. Could you make a
val_whatshap_pedigree_phasinginput tonallo.nf?
| ch_ped_family | |
| val_whatshap_pedigree_phasing ? SOMALIER_PED_FAMILY.out.ped : [[],[]] |
There was a problem hiding this comment.
Commit 8a7b339 was abusive : you have to use an intermediate variable with a correct meta, otherwise --whatshap_pedigree_phasing false raises a join mismatch error there : https://github.com/tetedange13/nallo/blob/7930be21c4c7a13a61a5232f1586eae4c7153307/subworkflows/local/whatshap/main.nf#L27
Fixed by commit 0a5a6d2
Merge conflicts each time..
Even if only applied when '--phaser whatshap'
+ put it inside 'if' block
|
I'll be away for a few weeks but I'll check in again when I get back, unless someone else can continue the review @tetedange13! |
To start working with new input structure
Description
Should adress issue #387
Changed
whatshap+ full family (proband + both parents), now use pedigree information to phaseMain steps
nallo.nf, applyaddChildWithTwoParentsToMetafunction on "by family pedigree" channel and set pedigree_file to "empty" if family not completepedigreetuple throughphasingthenwhatshapsubworkflowssubworkflows/whatshap, use same approach to sync PED with BAM+VCF (otherwise PED files could be swapped between 2 different families)nf-core/whatshap/phasemodule, that I patched to add a newch_pedigreeinputWarnings / limitations
subworkflows/whatshapto capture this new behaviour ? Would require 2 trios, or best 2 trios + 1 solo ? Is there enough test data on genomic-medicine-sweden/test-datasets for that ?nf-core modules lint whatshap/phasenot passing anymore, even if I usednf-core modules patch. Complains aboutmeta.ymlmissing newpedigreeinput, even if it is well present...Thanks for taking a look !
Best,
Felix.