Skip to content

Fix: Prevent Orphaned Submissions When SQS Publish Fails#5009

Open
WHOIM1205 wants to merge 6 commits into
Cloud-CV:masterfrom
WHOIM1205:fix/orphaned-submission-on-sqs-failure
Open

Fix: Prevent Orphaned Submissions When SQS Publish Fails#5009
WHOIM1205 wants to merge 6 commits into
Cloud-CV:masterfrom
WHOIM1205:fix/orphaned-submission-on-sqs-failure

Conversation

@WHOIM1205

Copy link
Copy Markdown

Description

This PR fixes a critical consistency bug in EvalAI’s submission pipeline where a submission could be successfully saved to the database but never queued for evaluation if publishing the SQS message failed.

Previously, the database write and SQS publish were performed non-atomically. Any failure during the SQS publish step (network issues, AWS credential expiry, throttling, or outages) resulted in orphaned submissions that remained permanently stuck in submitted state and silently consumed the participant’s submission quota.

This change ensures that a submission is either:

  • Saved and successfully queued, or
  • Cleanly rolled back if queuing fails

No orphaned submissions are left behind under any failure scenario.


What Was Fixed

  • Wrapped the SQS publish step in proper exception handling
  • Implemented a compensating transaction pattern
  • If SQS publish fails:
    • The newly created submission is deleted
    • Participant quota is immediately freed
    • A clear error response is returned to the user
    • The failure is logged using logger.exception()
  • The success path remains unchanged

Why This Is Important

This bug primarily affected high-traffic moments like competition deadlines, when SQS failures are most likely.

Without this fix:

  • Participants could permanently lose submission attempts
  • Submissions could appear in dashboards but never be evaluated
  • Competition fairness could be compromised
  • Manual database cleanup was required to recover lost quota

With this fix:

  • Every accepted submission is guaranteed to be queued
  • No submission quota is silently consumed
  • Platform integrity and fairness are preserved

Code Changes

Modified File

  • apps/jobs/views.py

Updated Function

  • challenge_submission (POST handler)

Summary of Change

  • Added failure-path handling around publish_submission_message
  • Deleted orphaned submissions when publish fails
  • Improved logging for operational visibility
  • Preserved existing API behavior and response contracts

No model changes.
No migrations required.
No changes to SQS message format.


Test Coverage

New unit tests were added to verify both failure and success paths.

Added Tests

  • test_challenge_submission_cleans_up_on_publish_failure
  • test_challenge_submission_handles_sqs_endpoint_failure
  • test_challenge_submission_preserves_quota_on_publish_failure
  • test_challenge_submission_returns_201_when_publish_succeeds

What These Tests Validate

  • Submissions are deleted when SQS publish fails
  • Participant quota is not consumed on failure
  • Realistic SQS errors (e.g. EndpointConnectionError) are handled
  • Existing happy-path behavior remains unchanged

All tests pass successfully.


Impact After Fix

  • Zero orphaned submissions
  • Accurate submission quota enforcement
  • Reliable evaluation pipeline under SQS failures
  • Better observability via structured error logs
  • Improved fairness during competition deadlines

Notes for Reviewers

  • The fix is intentionally minimal and localized
  • No breaking changes to API behavior
  • No background jobs or cleanup tasks introduced
  • Designed to be safe under all known SQS failure modes

This change eliminates a production-critical failure mode in EvalAI’s core submission flow.

@WHOIM1205

Copy link
Copy Markdown
Author

Hey @RishabhJain2018

I fixed an issue in the submission flow where a submission could get saved but never reach the evaluation queue if the SQS publish failed. That was leaving submissions stuck forever and eating into user quota.

The fix makes sure we either queue the submission successfully or clean it up properly on failure. I also added tests to cover the failure cases and make sure the existing behavior stays the same.

Would love your thoughts or any feedback on the approach.

@codecov

codecov Bot commented Feb 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.16%. Comparing base (74d237e) to head (10cda95).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5009      +/-   ##
==========================================
+ Coverage   92.15%   92.16%   +0.01%     
==========================================
  Files          87       87              
  Lines        7376     7386      +10     
==========================================
+ Hits         6797     6807      +10     
  Misses        579      579              
Flag Coverage Δ
backend 96.76% <ø> (+<0.01%) ⬆️
frontend 87.45% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Accounts & Authentication 98.60% <ø> (ø)
Challenges Management 95.46% <ø> (+0.01%) ⬆️
Job Processing 98.84% <ø> (ø)
Participants & Teams 100.00% <ø> (ø)
Challenge Hosts 100.00% <ø> (ø)
Analytics 100.00% <ø> (ø)
Web Interface 100.00% <ø> (ø)
Frontend (Gulp) 87.45% <ø> (+<0.01%) ⬆️
All Models 97.64% <ø> (+<0.01%) ⬆️
All Views 100.00% <ø> (ø)
All Serializers 98.49% <ø> (ø)
Utility Functions 96.88% <ø> (ø)
Core Configuration 82.35% <ø> (ø)
Files with missing lines Coverage Δ
apps/jobs/views.py 100.00% <ø> (ø)

... and 6 files with indirect coverage changes

Files with missing lines Coverage Δ
apps/jobs/views.py 100.00% <ø> (ø)

... and 6 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 74d237e...10cda95. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread apps/jobs/views.py
submission.pk,
challenge_id,
)
submission.delete()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @WHOIM1205 , deleting the submission isn't a good idea. Maybe set it as cancelled?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, agreed
I’ve updated the logic to cancel the submission instead of deleting it and pushed the changes. Thanks for calling this out!

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@WHOIM1205 WHOIM1205 force-pushed the fix/orphaned-submission-on-sqs-failure branch from a4efa13 to 992dd40 Compare February 10, 2026 23:07
@WHOIM1205

Copy link
Copy Markdown
Author

@RishabhJain2018 is there anything i can fix in this pr

@RishabhJain2018

Copy link
Copy Markdown
Member

Can you please check why the Travis build is failing?

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@WHOIM1205

Copy link
Copy Markdown
Author

Can you please check why the Travis build is failing?

I am not able to fix this

@RishabhJain2018

RishabhJain2018 commented Feb 14, 2026

Copy link
Copy Markdown
Member

What is the issue?

@WHOIM1205

Copy link
Copy Markdown
Author

What is the issue?

image

@RishabhJain2018

Copy link
Copy Markdown
Member

Hey @WHOIM1205 , Please try restarting the build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants