Builds: isolated builds by humitos · Pull Request #13104 · readthedocs/readthedocs.org

humitos · 2026-06-04T15:46:07Z

Description

This PR allows us to change the architecture of how we run builds in Read the Docs. The architecture document describes in depth how is the implementation for this: architecture.md. Please, read that document and let me know if you have any concern about it.

Notes

There is a ./dev-run.sh script in the repository that I used to develop the Python script that builds the docs inside a container. We are not going to use it in production. You can skip it completely since it just development focused.
Inside the app, we use ./entrypoint.sh which is the script that runs directly inside the Docker container and handles process signals (SIGTERM, SIGKILL, etc)
Managing of Docker images in AWS with ECR is not fully decided yet; I'm doing my first tests there still.
~~ECS+Fargate is still under testing -- but I will probably need to deploy the code from this PR to do real tests.~~ I was able to test it by calling boto3 manually in production and running a build for test-builds-fargate project. The infrastructure works 🎉
The implementation for HTML upload done in builds: pre-built HTML upload as a Version source type #13011 is not included in readthedocs-builder yet.

Improvements / Features

Each build/project/organization can spin Docker containers with specific vCPU/RAM.
Security -- due all the commands run inside a Docker container without access to anything from our infra/code.
We can eventually give users root capabilities if we want.
No need to have build-* queues and handle queues backups.
Reduce Celery headaches a lot.
Easy to test different combinations of build instance types.
There is no limit of concurrent builds running at a time in the whole platform.
If we want to, we can limit the amount of running builds by queuing submit_build_to_ecs tasks instead of handling them immediately. I don't think this is needed tho.
We can deploy new code to readthedocs-builder without the need to create an AMI and perform a full deploy, and it will be taken immediately on new Docker containers.

Rollout

Behind a feature flag. We can add a subset of projects to start testing this.
We need this PR to be merged and deployed using the current process. There is no too much changes in this PR, most of the code is isolated under readthedocs/projects/tasks/fargate.py file. The rest are new model fields, new settings and path decision depending on the feature flag being enabled or not.
We can deploy this PR and perform live changes on projects with the feature flag by updating rel branch on readthedocs-builds for quick tests.

Decisions / ToDo

The test suite we currently have is not migrated to the new repository. We will need to think and decide what to do there. Since the builder can be ran locally now, we would be able to test more deeply if we want to.
We need to work in the TF code for the infrastructure.
What are the next steps here?

How to review this PR

Consider the big picture as a whole.
Take a look at the code finding for something that could break production. Those are important to fix at this point.
Review the architecture.md file and try to load all that information in your mind 😄 . We are changing how all the builds are processed.
Look for inconsistencies in that document and potential breaks in data flows and similar.
Don't look at all at the code from readthedocs-builder. I haven't review it either and I'm not worried about it at this point. I will get back to that later, once we have it tested.

Test this locally

Pull down this PR
Clone readthedocs-builder and build the Docker image

docker build -t builder-dev:latest .

Define RTD_PATH_BUILDER environment variable to where you cloned readthedocs-builder (e.g. /home/username/work/readthedocs-builder, it has to be the full path)
Start local development and add use_fargate_builder to a project.
Trigger a build for that project.
Check the logs for that app as usual.
Check the logs for the Docker container with docker logs -f build-<pk>

Original discussion: https://github.com/readthedocs/meta/discussions/135
Issue: https://github.com/readthedocs/meta/issues/210

…BUILDER First piece of the Fargate builder migration. Adds the data-model surface the new code path will need; nothing reads or writes these fields yet. - Build.task_arn (CharField, max_length=255): set by the new submit_build_to_ecs Celery task; consumed by cancel_build to call ecs:StopTask. Coexists with the legacy task_id field for the duration of the rollout. - Project.container_cpu_limit (PositiveIntegerField): Fargate CPU units (1024 = 1 vCPU). Paired with the existing container_mem_limit / container_time_limit; defaults to the system-wide default when unset. - Feature.USE_FARGATE_BUILDER: per-project feature flag the trigger and cancel dispatchers will check to decide between the legacy update_docs_task path and the new Fargate path. Migrations are additive (Safe.before_deploy) so the legacy code keeps working through the rollout. See readthedocs-builder/docs/architecture.md for the broader design.

Second piece of the Fargate builder migration. Lands the Celery task that the trigger dispatcher will route to in Phase 5, plus the settings it relies on. Nothing calls submit_build_to_ecs yet — that wiring happens in the next piece. readthedocs/projects/tasks/fargate.py: - submit_build_to_ecs(build_pk): sparse-clones the YAML, resolves build.os, resolves per-build CPU/memory/time-limit, mints a per-build API key, calls ecs:RunTask, and stores the returned task ARN on Build.task_arn. - _sparse_clone_yaml: ``--filter=blob:none --no-checkout`` then ``sparse-checkout`` for the four candidate filenames (with/without dot, .yaml/.yml). HTTPS clone with token injection; SSH errors out (deferred). - _read_build_os: parse the YAML, resolve the ``ubuntu-lts-latest`` alias. - _resolve_fargate_resources: layer project field -> settings default -> RTD_BUILD_MAX_* caps -> snap to a Fargate-supported CPU/memory pair. - _snap_to_fargate_pair: hardcoded AWS Fargate CPU/memory matrix (256/512/1024/2048/4096/8192/16384 vCPU tiers). - _ecs_run_task: boto3 RunTask with FARGATE_SPOT capacity provider, the CPU+memory + command + env overrides, and the awsvpc network config. AWS errors wrapped in BuildAppError for consistent handling. - Feature.USE_FARGATE_BUILDER is enforced defensively at the top of the task in case the dispatcher routes here without the flag. readthedocs/settings/base.py: - RTD_BUILD_DEFAULT_CPU / MEMORY / TIME_LIMIT (2048 / 8192 / 1800). - RTD_BUILD_MAX_CPU / MEMORY / TIME_LIMIT system-wide caps. - RTD_BUILD_TIME_LIMIT_GRACE_SECONDS / KILL_SECONDS for the bash watchdog. - RTD_BUILDER_REPO / REF (default ``rel``). - RTD_ECS_CLUSTER / TASK_DEFINITION_FORMAT / SUBNETS / SECURITY_GROUPS / ASSIGN_PUBLIC_IP / REGION (production-only; empty placeholders here). Smoke tests: - ``python -m py_compile`` on the task and migrations. - _parse_mem_limit_mb on int, ``8g``, ``512m``, plain digits, None, empty, garbage. - _snap_to_fargate_pair: exact pairs, snap-up within tier, lower bound, CPU cap, memory cap within tier, both-cap upper bound. See readthedocs-builder/docs/architecture.md for the broader design.

…uild Third (and final) piece of the Fargate builder migration. Phase 4 is now complete: enabling Feature.USE_FARGATE_BUILDER on a project routes its builds through ``submit_build_to_ecs`` -> Fargate; everything else continues to use ``update_docs_task`` as before. readthedocs/core/utils/__init__.py: - trigger_build: after prepare_build, branch on project.has_feature(Feature.USE_FARGATE_BUILDER). Fargate path enqueues submit_build_to_ecs.delay(build_pk=build.pk) instead of apply_async on the update_docs_task signature. The signature still gets built but is unused on the Fargate path — accepted as minor wasted work for the rollout window. - cancel_build: branch on which task identifier is set on the Build: 1. task_arn (Fargate dispatched) -> ecs:StopTask with reason. Failures are logged but don't propagate (e.g. the task may have already exited by the time we cancel). 2. task_id (legacy Celery) -> app.control.revoke as before. 3. Neither -> log a notice. submit_build_to_ecs guards against this case by checking BUILD_STATE_CANCELLED at the top. The branch is on the build's *actual* state, not the project's current flag value, so an in-flight build that started before the flag was flipped still cancels correctly. readthedocs/projects/tasks/fargate.py: - submit_build_to_ecs: at the top, bail out if Build.state == BUILD_STATE_CANCELLED. Covers the race where cancel_build flips the state while the Celery task is still queued; we skip API-key minting + ecs:RunTask + arn write entirely. Smoke tests: - ``python -m py_compile`` on both files. - grep-audit confirms task_arn / USE_FARGATE_BUILDER / stop_task / revoke wiring is in place. See readthedocs-builder/docs/architecture.md (Phase 4) for the broader design.

When RTD_DOCKER_COMPOSE is set (the local dev environment), submit_build_to_ecs spawns a sibling builder container on the host's docker daemon via docker-py instead of calling ecs:RunTask. Same interface, same env vars; just a different backend behind _dispatch_build_task. Lets us test the whole new build path end-to-end against devthedocs.org + rustfs without standing up a real ECS cluster. readthedocs/settings/docker_compose.py: - RTD_LOCAL_BUILDER_IMAGE (default: builder-dev:latest). Built once by the dev via ``cd ../readthedocs-builder && docker build -t builder-dev:latest .`` - RTD_LOCAL_BUILDER_HOST_PATH (from env var). When set, bind-mounted at /opt/builder so the entrypoint skips the GitHub clone; matches the dev-run.sh iteration loop. Leave unset to exercise the clone path. readthedocs/projects/tasks/fargate.py: - _dispatch_build_task: branches on settings.RTD_DOCKER_COMPOSE. Routes to _docker_run_task (dev) or _ecs_run_task (prod). - _docker_run_task: docker-py client.containers.run with cpu/memory constraints (Fargate units -> nano_cpus/mem_limit), the compose network (RTD_DOCKER_COMPOSE_NETWORK), the optional host bind-mount, detach + auto_remove. Returns ``docker://<container-id>`` pseudo-ARN. - submit_build_to_ecs: when in docker-compose mode, forward AWS_* + S3 bucket env vars (matching what dev-run.sh forwards), set RTD_DOCKER_USER=root, and append --storage-from-env to the runner command so the runner uses env-var-based AWS creds instead of the API's STS endpoint. readthedocs/core/utils/__init__.py (cancel_build): - If task_arn starts with ``docker://``, look up the container via docker.from_env().containers.get(...).kill(). Failures logged, not propagated (matches the ecs:StopTask behaviour). - Otherwise fall through to the existing ecs:StopTask path. Setup notes (not in this commit; dev's job): - Mount /var/run/docker.sock into the celery container. - Build the builder-dev image once. - Export RTD_LOCAL_BUILDER_HOST_PATH=/absolute/path/to/readthedocs-builder before bringing up docker-compose if you want the bind-mount. See readthedocs-builder/docs/architecture.md.

Three follow-ups to commit c5c00c9: 1. KeyError for the fargate task on the celery worker. Tasks in ``readthedocs/projects/tasks/`` are loaded explicitly from ``ProjectsConfig.ready()`` (PEP 420 namespace package, no __init__.py, so Celery's autodiscover doesn't pick up submodules on its own). Added the import for fargate.py alongside builds / search / utils. 2. Drop the ``docker://`` prefix on Build.task_arn. The runtime distinction is already captured by ``settings.RTD_DOCKER_COMPOSE``; sticking with raw container ids / ECS ARNs keeps the field clean. cancel_build now branches on the setting instead of parsing a prefix. 3. The runner uses the API's STS endpoint for storage credentials in both production and local dev (mirrors production semantics). Drop --storage-from-env from the dispatched command and stop forwarding AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / RTD_S3_*_BUCKET. We still forward AWS_S3_ENDPOINT_URL because boto3 needs to know where to point in dev (rustfs at http://storage:9000); credentials and bucket names come from /api/v2/build/<id>/credentials/storage. Rename: RTD_LOCAL_BUILDER_HOST_PATH -> RTDDEV_PATH_BUILDER for consistency with the existing RTDDEV_* naming convention. RTD_LOCAL_BUILDER_IMAGE keeps a TODO marking it for removal once the readthedocs/builder:<os> image matrix exists and we can pick the image from build.os exactly like production does.

Matches the convention the rest of the codebase uses for web-celery tasks (audit / telemetry / subscriptions / search / etc. all use queue="web"). Otherwise the task defaults to the celery queue and gets picked up by the build worker, which doesn't know about it.

- no YAML file - invalid clone URL - etc

humitos added 15 commits June 3, 2026 11:35

Updates to make it work on local development

f7c3a20

NGINX access and error logs

93c7b8e

Handle exceptions _before_ running inside the container

5de5bb7

- no YAML file - invalid clone URL - etc

Handle canceling from UI

02a1518

Mount the builder repository as read-only

32b73d1

Allow iterations on the entrypoint without rebuilding Docker image

23ae324

Update common/

4b8d08f

Log the exception when submit_build_to_ecs fails

469492c

Update settings to make it easier to run locally

65fdfb0

humitos requested a review from a team as a code owner June 4, 2026 15:46

humitos requested a review from stsewd June 4, 2026 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Builds: isolated builds#13104

Builds: isolated builds#13104
humitos wants to merge 15 commits into
mainfrom
humitos/isolated-builds

humitos commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

humitos commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Notes

Improvements / Features

Rollout

Decisions / ToDo

How to review this PR

Test this locally

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

humitos commented Jun 4, 2026 •

edited

Loading