Skip to content

Masurca#11049

Open
LiaOb21 wants to merge 29 commits intonf-core:masterfrom
LiaOb21:masurca
Open

Masurca#11049
LiaOb21 wants to merge 29 commits intonf-core:masterfrom
LiaOb21:masurca

Conversation

@LiaOb21
Copy link
Copy Markdown
Contributor

@LiaOb21 LiaOb21 commented Mar 25, 2026

Masurca module [draft]

Supports assembly with Cabog assembly only. Flye and SOAPdenovo are not supported.

Supported reads:

  • single end or merged reads
  • paired end only
  • paired end + mate pair (jump)
  • paired end + nanopore
  • paired end + pacbio
  • paired end + nanopore + pacbio

Does not support other kinds of reads (Sanger, 454, etc) and synteny guided assembly with reference genome

PR checklist

Closes #10945

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Broadcast software version numbers to topic: versions - See version_topics
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@chriswyatt1
Copy link
Copy Markdown
Contributor

Are the value here pipeline wide, or specific to each genome being run? Is that why they need to be specific channel inpus?
val fragment_mean
val fragment_stdev
val jump_mean
val jump_stdev

@chriswyatt1
Copy link
Copy Markdown
Contributor

path(other_reads) doesn't appear to be used, so could be removed

@chriswyatt1
Copy link
Copy Markdown
Contributor

chriswyatt1 commented Mar 26, 2026

Looks like a specific build type, so I think seqera containers won't be able to handle that.
https://github.com/alekseyzimin/masurca?tab=readme-ov-file#compileinstall-requirements
We will need to build this manually I think. I can have a go

Then we will need to do this:
https://nf-co.re/docs/contributing/guidelines/recommendations/custom_containers

@LiaOb21
Copy link
Copy Markdown
Contributor Author

LiaOb21 commented Mar 26, 2026

Are the value here pipeline wide, or specific to each genome being run? Is that why they need to be specific channel inpus? val fragment_mean val fragment_stdev val jump_mean val jump_stdev

I think they are meant to be as realistic as possible for specific genomes. However, I ran all the tests locally and manually in a conda env using the same nf-core datasets I'm using in main.test.nf and using the same values, and they all gave me the expected outputs.

path(other_reads) doesn't appear to be used, so could be removed

Yes, I was initially planning to integrate that as well, but then I decided to leave it. I forgot to remove it from the input tuple.

Looks like a specific build type, so I think seqera containers won't be able to handle that. https://github.com/alekseyzimin/masurca?tab=readme-ov-file#compileinstall-requirements We will need to build this manually I think. I can have a go

Thank you @chriswyatt1! 😊

@chriswyatt1
Copy link
Copy Markdown
Contributor

OK, Just thinking to make this module more atomic, we should have these values as ext args. But if its helpful, to have this in a channel, I think it can be justified. If we think that any user would need to parse this information in for multiple genomes always, rather than set them generally in the modules.config say

@chriswyatt1
Copy link
Copy Markdown
Contributor

The only other thing of general comments, is about all the echo statements. Is this essential, or maybe we could move some of it into a script maybe. I will check out what other modules do in this case

@LiaOb21
Copy link
Copy Markdown
Contributor Author

LiaOb21 commented Mar 26, 2026

OK, Just thinking to make this module more atomic, we should have these values as ext args. But if its helpful, to have this in a channel, I think it can be justified. If we think that any user would need to parse this information in for multiple genomes always, rather than set them generally in the modules.config say

I think the ideal use would be to have a specific value per genome. Personally I'm using an average fixed value, as I often don't have information about the fragment size. Also from masurca documentation is not completely clear if they refer to the fragment size or the insert size. I don't know what would be more appropriate for the nf-core module, so I'll follow your suggestion 😊

The only other thing of general comments, is about all the echo statements. Is this essential, or maybe we could move some of it into a script maybe. I will check out what other modules do in this case

Yes, so that's the main reason why I was so confused on how setting up this module. I asked in the community and they suggested to have a look to these two modules:

controlfreec also uses echo commands and I thought that was probably the easiest way to build the config for masurca. But happy to change if there is any better way to deal with this! 😊

@chriswyatt1
Copy link
Copy Markdown
Contributor

I made a container, not sure it will work, but worth a try:

quay.io/ecoflowucl/masurca so in the module you can just put ecoflowucl/masurca

test with docker pull quay.io/ecoflowucl/masurca:v4.1.4

Dockerfile is here: https://github.com/Eco-Flow/docker-build/tree/main/masurca

@chriswyatt1
Copy link
Copy Markdown
Contributor

I think the container works. But yes, there are some output files that have time stamps and specific work dir paths, which will change on each run. So we need to make the nf-test more specific. Pick files that won't change
assemble.sh
#!/bin/bash

assemble.sh generated by masurca:

CONFIG_PATH="/Users/cwyatt/Downloads/modules-2/.nf-test/tests/aad480bb9b0561d1bd4bc958001b0e93/work/e4/920dca1167539f25218df125acaf84/test_masurca_config.txt"

test-masurca.log:
[Thu Mar 26 18:02:31 UTC 2026] Processing pe library reads
[Thu Mar 26 18:02:32 UTC 2026] Average PE read length 100
[Thu Mar 26 18:02:32 UTC 2026] Using kmer size of 67 for the graph
[Thu Mar 26 18:02:32 UTC 2026] MIN_Q_CHAR: 33
[Thu Mar 26 18:02:32 UTC 2026] Creating mer database for Quorum

@chriswyatt1
Copy link
Copy Markdown
Contributor

@LiaOb21 did you manage to test the container? I think it worked,,, but let me know if not, and I can look into it

@LiaOb21
Copy link
Copy Markdown
Contributor Author

LiaOb21 commented Apr 1, 2026

@LiaOb21 did you manage to test the container? I think it worked,,, but let me know if not, and I can look into it

Hey @chriswyatt1 the container works! 😊 thank you so much!

Looks like a specific build type, so I think seqera containers won't be able to handle that. https://github.com/alekseyzimin/masurca?tab=readme-ov-file#compileinstall-requirements We will need to build this manually I think. I can have a go

Then we will need to do this: https://nf-co.re/docs/contributing/guidelines/recommendations/custom_containers

We can proceed with this, in the meantime I'll try to set up the tests in a way that they can ignore the timestamps and so on

@LiaOb21
Copy link
Copy Markdown
Contributor Author

LiaOb21 commented Apr 7, 2026

@nf-core-bot fix linting please

@LiaOb21
Copy link
Copy Markdown
Contributor Author

LiaOb21 commented Apr 7, 2026

Hello @mashehu @maxulysse and @chriswyatt1! 😊
I think this is almost ready for review.

As regards the container, here is the conda recipe: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/masurca/meta.yaml

The reason why the biocontainer obtained from conda doesn't work is because some dependencies, like for example file, are not in the recipe although they are used from masurca. The tests work with conda because these dependencies are normally available in a linux environment, but they fail with docker for obvious reasons. However, the container created by @chriswyatt1 works, and this is what I am using at the moment, and I already asked on slack to host it in the nf-core org.

Also, I was having some linting issues:

╭─ [✗] 1 Module Test Failed ───────────────────────────────────────────────────╮
│             ╷                               ╷                                │
│ Module name │ File path                     │ Test message                   │
│╶────────────┼───────────────────────────────┼───────────────────────────────╴│
│ masurca     │ modules/nf-core/masurca/main… │ main_nf_ext_key: Invalid 'ext' │
│             │                               │ keys detected:                 │
│             │                               │ ext.extend_jump_reads,         │
│             │                               │ ext.extend_jump_reads,         │
│             │                               │ ext.graph_kmer_size,           │
│             │                               │ ext.use_linking_mates,         │
│             │                               │ ext.use_linking_mates,         │
│             │                               │ ext.lhe_coverage,              │
│             │                               │ ext.mega_reads_one_pass,       │
│             │                               │ ext.mega_reads_one_pass,       │
│             │                               │ ext.limit_jump_coverage,       │
│             │                               │ ext.ca_parameters,             │
│             │                               │ ext.close_gaps,                │
│             │                               │ ext.close_gaps, ext.jf_size    │
│             ╵                               ╵                                │
╰──────────────────────────────────────────────────────────────────────────────╯

Then I saw this issue: #8716
And I moved the task.ext to inputs. The linting now works. I left the task.ext commented just to double check with you if this is the correct approach and go back quickly if necessary.

And, finally, the tests. These are avoiding the md5 check because masurca works with absolute paths, and these paths change for every run making the md5 inconsistent. Please let me know if I used the right approach or if there is anything else to change.

Thank you so much for your help! 😊

Co-authored-by: Famke Bäuerle <45968370+famosab@users.noreply.github.com>
@LiaOb21
Copy link
Copy Markdown
Contributor Author

LiaOb21 commented Apr 10, 2026

Hi @mashehu, the bioconda recipe has been update and works. I tried to obtain the docker and singularity with wave cli using both pixi:v1 and default template:

---
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json
channels:
  - conda-forge
  - bioconda
dependencies:
  - "bioconda::masurca=4.1.4=ha5bb246_1"

wave --conda-file environment.yml --freeze --platform linux/amd64 -o yaml --build-template conda/pixi:v1 --await
buildId: bd-2788c9bd40859936_1
cached: true
containerImage: community.wave.seqera.io/library/masurca:4.1.4--2788c9bd40859936
duration: PT2M52.608153482S
freeze: true
mirror: false
requestId: 4a0e1f34d473
scanId: sc-e633b339ce626493_1
status: DONE
succeeded: true
targetImage: community.wave.seqera.io/library/masurca:4.1.4--2788c9bd40859936
wave --conda-file environment.yml --freeze --platform linux/amd64 -o yaml --build-template conda/pixi:v1 --singularity --await
buildId: bd-2d0ee6b415d9e77e_1
cached: false
containerImage: oras://community.wave.seqera.io/library/masurca:4.1.4--2d0ee6b415d9e77e
duration: PT3M32.779779428S
freeze: true
mirror: false
requestId: 5fbb5d604d9e
status: DONE
succeeded: true
targetImage: oras://community.wave.seqera.io/library/masurca:4.1.4--2d0ee6b415d9e77e
wave --conda-file environment.yml --freeze --platform linux/amd64 -o yaml --await
buildId: bd-6c6f779135534097_1
cached: false
containerImage: community.wave.seqera.io/library/masurca:4.1.4--6c6f779135534097
duration: PT1M59.355046634S
freeze: true
mirror: false
requestId: 7bc08416d07c
scanId: sc-680889439ee593d6_1
status: DONE
succeeded: true
targetImage: community.wave.seqera.io/library/masurca:4.1.4--6c6f779135534097
wave --conda-file environment.yml --freeze --platform linux/amd64 -o yaml --singularity --await
buildId: bd-40a5894c07082589_1
cached: false
containerImage: oras://community.wave.seqera.io/library/masurca:4.1.4--40a5894c07082589
duration: PT2M39.588376664S
freeze: true
mirror: false
requestId: f7d8cd412b66
status: DONE
succeeded: true
targetImage: oras://community.wave.seqera.io/library/masurca:4.1.4--40a5894c07082589

Again, the new conda recipe works, while both the docker versions (pixi and default) give me the error:
Overlap/unitig failed, check output under CA/ and runCA1.out, try re-generating and re-running assemble.sh for each of the tests.

Is there anything else I can try? I believe I added all the required dependencies to the bioconda recipe now.

@LiaOb21 LiaOb21 marked this pull request as ready for review April 10, 2026 10:22
@LiaOb21
Copy link
Copy Markdown
Contributor Author

LiaOb21 commented Apr 10, 2026

I forgot to add that I tested the docker container from bioconda fetch artifacts with same test data and masurca config files and it works without problems.

config_lines << "END"
config_lines << ""

// PARAMETERS section
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a very long time since I've used Masura so I don't remember much about what's required input. Can these be supplied through ext.args?

The idea would be that you would have

config_lines.addAll(args.collect{ key, value -> "${key}=${value}"})

So ext.args is a map in this case (for example like modules VIBER or MOBSTER).

The user would then supply lines through the ext.args

withName: 'MASURCA' {
    ext.args = [
        LHE_COVERAGE: 35,
        SOAP_ASSEMBLY: 0,
    ]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mahesh-panchal, I moved several arguments to inputs in this commit: LiaOb21@c3c4faf
I made this change because the linter was throwing an error, if I remember well it wanted those arguments documented in meta.yml, though I'm not entirely sure. I believe the map won't trigger the same issue, so we could try that approach.
However, regarding the Masurca inputs, the only guidance available on configuring it comes from comments within the config file itself. Unfortunately, those comments don't clearly specify which parameters are optional. In the past, I attempted to omit some non-required parameters based on the input files I was working with, but this caused the generation of assemble.sh to fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

container-issue help wanted Extra attention is needed

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

new module: MASURCA

7 participants