Skip to content

Abort execution when platform telemetry error#6827

Merged
bentsherman merged 8 commits intomasterfrom
nf-356-tower-abort-on-error
Apr 22, 2026
Merged

Abort execution when platform telemetry error#6827
bentsherman merged 8 commits intomasterfrom
nf-356-tower-abort-on-error

Conversation

@jorgee
Copy link
Copy Markdown
Contributor

@jorgee jorgee commented Feb 12, 2026

This pull request introduces a new mechanism to control error handling behavior in the TowerClient class by adding an abortOnError flag, which can be set via the environment variable TOWER_ABORT_ON_ERROR. When enabled, critical errors encountered while communicating with Seqera Platform will cause the workflow to abort immediately using the AbortRunException. The changes also include improved error propagation and additional tests to verify this behavior.

Error Handling Improvements:

  • Added abortOnError flag to TowerClient, defaulting to true, and made it configurable via the TOWER_ABORT_ON_ERROR environment variable. This determines whether critical errors abort the workflow or are handled as warnings. [1] [2]
  • Updated error handling in TowerClient methods (logHttpResponse, parseTowerResponse, and others) to throw AbortRunException when abortOnError is enabled, ensuring immediate workflow termination on critical errors. [1] [2] [3] [4] [5]

Session and Exception Propagation:

  • Modified the Session class to specifically catch and log AbortRunException during observer notification, ensuring these exceptions propagate and abort the workflow as intended. [1] [2]

Tests:

  • Added new tests in TowerClientTest to verify the correct detection of the abortOnError setting and to ensure that the workflow aborts as expected when errors occur and abortOnError is enabled. [1] [2]

…lemetry errors

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 12, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 1d93537
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69e898a387e3aa0008ebb672

Comment thread plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerClient.groovy Outdated
Comment thread plugins/nf-tower/src/main/io/seqera/tower/plugin/TowerClient.groovy Outdated
Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
Signed-off-by: Jorge Ejarque <jorgee@users.noreply.github.com>
@jorgee
Copy link
Copy Markdown
Contributor Author

jorgee commented Apr 14, 2026

I have updated this branch to master.
If I am not wrong, in one of our conversations, we agreed to just fail for platform cases instead of all exceptions. This PR implements this behaviour and allows to work also as before using an environment variable. I will close #6823 in favor of this one.

@bentsherman bentsherman linked an issue Apr 15, 2026 that may be closed by this pull request
@bentsherman bentsherman requested a review from pditommaso April 15, 2026 15:40
@bentsherman
Copy link
Copy Markdown
Member

@pditommaso let me know if the latest changes look good to you

I like the general principle of "plugin observer errors are logged as warnings by default, observer can throw AbortRunException to fail the run"

This PR currently just adds an env var to control whether the TowerClient throws hard/soft errors. But I wonder if we should just decide for each error case whether it should be hard or soft instead of introducing an environment var

For example, if I run with -with-tower but fail to authenticate, maybe that should just always be a hard error

In any case, let's wait until #6946 is merged since it refactors the tower client (harder to resolve merge conflicts there)

@bentsherman
Copy link
Copy Markdown
Member

From today's discussion -- @jorgee please update this PR making all failures related to sending data to Platform hard failures, removing the need for the environment variable.

jorgee and others added 2 commits April 21, 2026 11:26
@jorgee
Copy link
Copy Markdown
Contributor Author

jorgee commented Apr 21, 2026

I have removed the environment variable and throw AbortRunException for all calls except heartbeats and onFlowComplete.

  • For onFlowComplete, it is the last message and execution will be finished anyway.
  • I have some doubts about the heartbeats case. Failing at the first exception could be very strict. Therefore, I have considered that a task complete message will be sent at somepoint and it will send the AbortRunException if there is a consistent connection problem. Is it fine? should I throw an AbortRunException at first error?

@jorgee jorgee linked an issue Apr 21, 2026 that may be closed by this pull request
@bentsherman
Copy link
Copy Markdown
Member

For the heartbeats, I think keeping them as warnings is reasonable

For the onFlowComplete, if it fails, wouldn't that leave the platform run in an indeterminate state? Or does platform still mark the run as completed?

@jorgee
Copy link
Copy Markdown
Contributor Author

jorgee commented Apr 21, 2026

For the onFlowComplete, if it fails, wouldn't that leave the platform run in an indeterminate state? Or does platform still mark the run as completed?

I' ll try to check what it does in this case. Anyway, it will not receive the trace event. The only possibility is if it monitors the head job and use the process exit code to decide if failed or not. In that case, I think it is better to warn instead of throwing the AbortRunException

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee jorgee force-pushed the nf-356-tower-abort-on-error branch from da4827b to 1d06df8 Compare April 22, 2026 08:39
jorgee added 2 commits April 22, 2026 10:52
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Copy Markdown
Contributor Author

jorgee commented Apr 22, 2026

Confirmed! I have run an execution skipping the trace completed event. The workflow state at platform is not undetermined. As head-job finishes, the workflow execution is considered completed, but marked as FAILED.

Last updates:

  • I have updated the branch with master
  • Fixed an issue introduced in TowerClient refactor where begin trace was failing but not detected by tests because the exception was not aborting the execution (commit 1d93537)

@bentsherman bentsherman merged commit b1ad3f7 into master Apr 22, 2026
25 checks passed
@bentsherman bentsherman deleted the nf-356-tower-abort-on-error branch April 22, 2026 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid Platform endpoint with -with-tower is silently ignored Pipeline completed with errors ends with correct exit-code

3 participants