Skip to content
51 changes: 51 additions & 0 deletions subworkflows/nf-core/snpclustering/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
IMPORT NF-CORE MODULES
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

include { BCFTOOLS_FILTER } from '../../../modules/nf-core/bcftools/filter/main'
include { PLINK2_INDEP_PAIRWISE } from '../../../modules/nf-core/plink2/indeppairwise/main'
include { PLINK2_RECODE_VCF } from '../../../modules/nf-core/plink2/recodevcf/main'
include { FLASHPCA2 } from '../../../modules/nf-core/flashpca2/main'

/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SUBWORKFLOW
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*/

workflow SNPCLUSTERING {
take:
meta
vcf
vcf_index
maf
missing

main:
versions = Channel.empty()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for each module if they still export the versions I think at least bcftools/filter does not anymore


BCFTOOLS_FILTER ( vcf.join(vcf_index), maf, missing )
versions = versions.mix(BCFTOOLS_FILTER.out.versions.first())

PLINK2_INDEP_PAIRWISE ( BCFTOOLS_FILTER.out.vcf )
versions = versions.mix(PLINK2_INDEP_PAIRWISE.out.versions.first())

PLINK2_RECODE_VCF ( PLINK2_INDEP_PAIRWISE.out.pgen )
versions = versions.mix(PLINK2_RECODE_VCF.out.versions.first())

FLASHPCA2 ( PLINK2_RECODE_VCF.out.vcf )
versions = versions.mix(FLASHPCA2.out.versions.first())

// TODO: qui aggiungeremo KMeans/DBSCAN/plot quando creeremo i moduli local
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there still something to add?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment @famosab .

You’re absolutely right — the clustering components (KMeans, DBSCAN), internal validation metrics (Silhouette, Calinski–Harabasz, Davies–Bouldin), non-linear embeddings (t-SNE/UMAP), and the final HTML report still need to be integrated.

These features are already implemented in the original pipeline (https://github.com/dbaku42/nf-core-snpclustering). I intentionally left them out of this PR to keep the subworkflow minimal and easier to review.

I’m happy to proceed in either of the following ways:

  1. Include all these components directly in this PR (my preferred option), or
  2. Add them in a dedicated follow-up PR immediately after this one is merged.

Please let me know which approach you’d prefer.

Thanks again!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say finalize it in one PR :) and then we can check if everything is done properly.

If you need extra modules that are not part of nf-core yet then please add them in a separate PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you do this please:

Please join the nf-core organization on GitHub to enable the CI-tests to run on your PR. You can request to join the organization via #github-invitations in the nf-core slack. You can join the nf-core slack via https://nf-co.re/join. :)


emit:
cluster_labels = Channel.empty() // placeholder
metrics = Channel.empty() // placeholder
plots = Channel.empty()
versions = versions
}
60 changes: 60 additions & 0 deletions subworkflows/nf-core/snpclustering/meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/subworkflows/nf-core/meta-schema.json

name: "snpclustering"
description: "End-to-end unsupervised clustering of genomic samples starting from multi-sample VCF files. Performs variant filtering (MAF + missingness), optional LD pruning, PCA (FlashPCA2 or IncrementalPCA), KMeans/DBSCAN clustering and internal validation."
keywords:
- genomics
- clustering
- unsupervised clustering
- VCF
- nf-core
authors:
- "Donald Baku (@dbaku42)"
components:
- bcftools/filter
- plink2/indep/pairwise
- plink2/recode/vcf
- plink2/indeppairwise
- plink2/recodevcf
- flashpca2
input:
- meta:
type: map
description: "Groovy Map containing sample metadata"
- vcf:
type: file
description: "Multi-sample VCF file (bgzipped and indexed)"
pattern: "*.{vcf,vcf.gz}"
Comment thread
dbaku42 marked this conversation as resolved.
Outdated
- vcf_index:
type: file
description: "Index of the VCF file (.tbi or .csi)"
pattern: "*.{tbi,csi}"
- maf:
type: float
description: "Minimum minor allele frequency threshold"
default: 0.01
- missing:
type: float
description: "Maximum missingness threshold"
default: 0.10
output:
- meta:
type: map
description: "Groovy Map containing sample metadata"
- cluster_labels:
type: file
description: "CSV file with per-sample cluster assignments"
pattern: "cluster_labels.csv"
- metrics:
type: file
description: "Table with all cluster quality metrics"
pattern: "*_metrics.tsv"
- plots:
type: file
description: "Directory containing publication-ready plots"
pattern: "plots/"
- versions:
type: file
description: "File containing versions of all tools used"
pattern: "versions.yml"
34 changes: 34 additions & 0 deletions subworkflows/nf-core/snpclustering/tests/main.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
nextflow_workflow {

name "Test Workflow SNPCLUSTERING"
script "../main.nf"
workflow "SNPCLUSTERING"
config "./nextflow.config"

tag "subworkflows"
tag "subworkflows_nfcore"
tag "subworkflows/snpclustering"
tag "bcftools/filter"
tag "plink2/indeppairwise"
tag "plink2/recodevcf"
tag "flashpca2"

test("vcf.gz input") {

when {
workflow {
"""
input[0] = [ id:'test' ]
input[1] = file(params.modules_testdata_base_path + 'genomics/homo_sapiens/illumina/vcf/test.vcf.gz', checkIfExists: true)
input[2] = file(params.modules_testdata_base_path + 'genomics/homo_sapiens/illumina/vcf/test.vcf.gz.tbi', checkIfExists: true)
input[3] = 0.01
input[4] = 0.10
"""
}
}

then {
assert workflow.success
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also want a snapshot here (look at other subworkflows)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test now passes with direct nf-test. The failure with nf-core subworkflows test is due to a temporary missing Wave container for the plink2/vcf module (manifest unknown). The logic and snapshot are correct.

}
}
}
2 changes: 2 additions & 0 deletions subworkflows/nf-core/snpclustering/tests/tags.yml
Comment thread
dbaku42 marked this conversation as resolved.
Outdated
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
subworkflows/snpclustering:
- subworkflows/nf-core/snpclustering/**
Loading