Skip to content

Add outer timeout for mpi testing#5150

Merged
connorjward merged 2 commits into
firedrakeproject:releasefrom
cpjordan:split-tests-outer-timeout
Jun 11, 2026
Merged

Add outer timeout for mpi testing#5150
connorjward merged 2 commits into
firedrakeproject:releasefrom
cpjordan:split-tests-outer-timeout

Conversation

@cpjordan

@cpjordan cpjordan commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes thetisproject/thetis#459 as demonstrated on thetisproject/thetis#438 (@connorjward).

This PR adds a shell-level timeout around each split job run by firedrake-run-split-tests.

The existing pytest timeout catches long-running pytest items, but it does not protect the whole mpiexec ... pytest process tree. We have observed (in Thetis - locally and in CI) MPI split jobs that print pytest success and then hang indefinitely during PETSc/OpenMPI finalization. In that state, the wrapper never writes jobN.errcode, GNU parallel never returns, and CI remains stuck.

Change

Each split job now runs as:

timeout --kill-after=${kill_after} ${outer_timeout} ${pytest_cmd}

inside the existing tee pipeline.

The timeout settings are configurable:

FIREDRAKE_RUN_SPLIT_TESTS_TIMEOUT=1800s
FIREDRAKE_RUN_SPLIT_TESTS_KILL_AFTER=60s

This PR assumes GNU timeout is available, matching the current Linux CI environment where firedrake-run-split-tests is used. If this helper is later used in macOS CI, a follow-up can either install Homebrew coreutils and use gtimeout, or add a small platform-specific wrapper.

@connorjward connorjward left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably go into release so it becomes available for Thetis sooner.

Comment thread scripts/firedrake-run-split-tests Outdated
kill_after=${FIREDRAKE_RUN_SPLIT_TESTS_KILL_AFTER:-60s}

if ! command -v timeout >/dev/null 2>&1; then
echo "GNU timeout is required" >&2

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your other PR you allow for GNU parallel not being there. This should also be optional no?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we definitely want this to run on macOS, so making it optional would help with that if you don't want to setup gtimeout as you suggested.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense but I was waiting for #5147 to be merged (I don't have the permissions to do so).

Do you want me to make the target branch for #5147 to be release as well? Then I can update this branch for your comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah #5147 should also go into release. I plan to merge that ASAP

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the base on #5147, will update this branch once merged

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates are in now

@cpjordan cpjordan force-pushed the split-tests-outer-timeout branch from d11a3d5 to 94e6599 Compare June 10, 2026 16:22
@cpjordan cpjordan changed the base branch from main to release June 10, 2026 16:23
@connorjward connorjward merged commit ffcad0c into firedrakeproject:release Jun 11, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel MPI testing hangs

2 participants