-
Notifications
You must be signed in to change notification settings - Fork 1k
Add snpclustering subworkflow #11059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 2 commits
6af3ab8
dea109e
8aac674
7f27d21
dabc976
81b57d1
b796acd
09473bc
b35034a
d6fdd58
c56e54c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| #!/usr/bin/env nextflow | ||
| nextflow.enable.dsl = 2 | ||
|
|
||
| /* | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| IMPORT NF-CORE MODULES | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| */ | ||
|
|
||
| include { BCFTOOLS_FILTER } from '../../../modules/nf-core/bcftools/filter/main' | ||
| include { PLINK2_INDEP_PAIRWISE } from '../../../modules/nf-core/plink2/indeppairwise/main' | ||
| include { PLINK2_RECODE_VCF } from '../../../modules/nf-core/plink2/recodevcf/main' | ||
| include { FLASHPCA2 } from '../../../modules/nf-core/flashpca2/main' | ||
|
|
||
| /* | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| SUBWORKFLOW | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| */ | ||
|
|
||
| workflow SNPCLUSTERING { | ||
| take: | ||
| meta | ||
| vcf | ||
| vcf_index | ||
| maf | ||
| missing | ||
|
|
||
| main: | ||
| versions = Channel.empty() | ||
|
|
||
| BCFTOOLS_FILTER ( vcf.join(vcf_index), maf, missing ) | ||
| versions = versions.mix(BCFTOOLS_FILTER.out.versions.first()) | ||
|
|
||
| PLINK2_INDEP_PAIRWISE ( BCFTOOLS_FILTER.out.vcf ) | ||
| versions = versions.mix(PLINK2_INDEP_PAIRWISE.out.versions.first()) | ||
|
|
||
| PLINK2_RECODE_VCF ( PLINK2_INDEP_PAIRWISE.out.pgen ) | ||
| versions = versions.mix(PLINK2_RECODE_VCF.out.versions.first()) | ||
|
|
||
| FLASHPCA2 ( PLINK2_RECODE_VCF.out.vcf ) | ||
| versions = versions.mix(FLASHPCA2.out.versions.first()) | ||
|
|
||
| // TODO: qui aggiungeremo KMeans/DBSCAN/plot quando creeremo i moduli local | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there still something to add?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for your comment @famosab . You’re absolutely right — the clustering components (KMeans, DBSCAN), internal validation metrics (Silhouette, Calinski–Harabasz, Davies–Bouldin), non-linear embeddings (t-SNE/UMAP), and the final HTML report still need to be integrated. These features are already implemented in the original pipeline (https://github.com/dbaku42/nf-core-snpclustering). I intentionally left them out of this PR to keep the subworkflow minimal and easier to review. I’m happy to proceed in either of the following ways:
Please let me know which approach you’d prefer. Thanks again!
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would say finalize it in one PR :) and then we can check if everything is done properly. If you need extra modules that are not part of nf-core yet then please add them in a separate PR.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also can you do this please: Please join the nf-core organization on GitHub to enable the CI-tests to run on your PR. You can request to join the organization via #github-invitations in the nf-core slack. You can join the nf-core slack via https://nf-co.re/join. :) |
||
|
|
||
| emit: | ||
| cluster_labels = Channel.empty() // placeholder | ||
| metrics = Channel.empty() // placeholder | ||
| plots = Channel.empty() | ||
| versions = versions | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,60 @@ | ||
| --- | ||
| # yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/nf-core/meta-schema.json | ||
|
|
||
| name: "snpclustering" | ||
| description: "End-to-end unsupervised clustering of genomic samples starting from multi-sample VCF files. Performs variant filtering (MAF + missingness), optional LD pruning, PCA (FlashPCA2 or IncrementalPCA), KMeans/DBSCAN clustering and internal validation." | ||
| keywords: | ||
| - genomics | ||
| - clustering | ||
| - unsupervised clustering | ||
| - VCF | ||
| - nf-core | ||
| authors: | ||
| - "Donald Baku (@dbaku42)" | ||
| components: | ||
| - bcftools/filter | ||
| - plink2/indep/pairwise | ||
| - plink2/recode/vcf | ||
| - plink2/indeppairwise | ||
| - plink2/recodevcf | ||
| - flashpca2 | ||
| input: | ||
| - meta: | ||
| type: map | ||
| description: "Groovy Map containing sample metadata" | ||
| - vcf: | ||
| type: file | ||
| description: "Multi-sample VCF file (bgzipped and indexed)" | ||
| pattern: "*.{vcf,vcf.gz}" | ||
|
dbaku42 marked this conversation as resolved.
Outdated
|
||
| - vcf_index: | ||
| type: file | ||
| description: "Index of the VCF file (.tbi or .csi)" | ||
| pattern: "*.{tbi,csi}" | ||
| - maf: | ||
| type: float | ||
| description: "Minimum minor allele frequency threshold" | ||
| default: 0.01 | ||
| - missing: | ||
| type: float | ||
| description: "Maximum missingness threshold" | ||
| default: 0.10 | ||
| output: | ||
| - meta: | ||
| type: map | ||
| description: "Groovy Map containing sample metadata" | ||
| - cluster_labels: | ||
| type: file | ||
| description: "CSV file with per-sample cluster assignments" | ||
| pattern: "cluster_labels.csv" | ||
| - metrics: | ||
| type: file | ||
| description: "Table with all cluster quality metrics" | ||
| pattern: "*_metrics.tsv" | ||
| - plots: | ||
| type: file | ||
| description: "Directory containing publication-ready plots" | ||
| pattern: "plots/" | ||
| - versions: | ||
| type: file | ||
| description: "File containing versions of all tools used" | ||
| pattern: "versions.yml" | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| nextflow_workflow { | ||
|
|
||
| name "Test Workflow SNPCLUSTERING" | ||
| script "../main.nf" | ||
| workflow "SNPCLUSTERING" | ||
| config "./nextflow.config" | ||
|
|
||
| tag "subworkflows" | ||
| tag "subworkflows_nfcore" | ||
| tag "subworkflows/snpclustering" | ||
| tag "bcftools/filter" | ||
| tag "plink2/indeppairwise" | ||
| tag "plink2/recodevcf" | ||
| tag "flashpca2" | ||
|
|
||
| test("vcf.gz input") { | ||
|
|
||
| when { | ||
| workflow { | ||
| """ | ||
| input[0] = [ id:'test' ] | ||
| input[1] = file(params.modules_testdata_base_path + 'genomics/homo_sapiens/illumina/vcf/test.vcf.gz', checkIfExists: true) | ||
| input[2] = file(params.modules_testdata_base_path + 'genomics/homo_sapiens/illumina/vcf/test.vcf.gz.tbi', checkIfExists: true) | ||
| input[3] = 0.01 | ||
| input[4] = 0.10 | ||
| """ | ||
| } | ||
| } | ||
|
|
||
| then { | ||
| assert workflow.success | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We also want a snapshot here (look at other subworkflows)
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The test now passes with direct nf-test. The failure with nf-core subworkflows test is due to a temporary missing Wave container for the plink2/vcf module (manifest unknown). The logic and snapshot are correct. |
||
| } | ||
| } | ||
| } | ||
|
dbaku42 marked this conversation as resolved.
Outdated
|
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| subworkflows/snpclustering: | ||
| - subworkflows/nf-core/snpclustering/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check for each module if they still export the versions I think at least bcftools/filter does not anymore