Skip to content

Builds: isolated builds#13104

Open
humitos wants to merge 15 commits into
mainfrom
humitos/isolated-builds
Open

Builds: isolated builds#13104
humitos wants to merge 15 commits into
mainfrom
humitos/isolated-builds

Conversation

@humitos
Copy link
Copy Markdown
Member

@humitos humitos commented Jun 4, 2026

Description

This PR allows us to change the architecture of how we run builds in Read the Docs. The architecture document describes in depth how is the implementation for this: architecture.md. Please, read that document and let me know if you have any concern about it.

Notes

  • There is a ./dev-run.sh script in the repository that I used to develop the Python script that builds the docs inside a container. We are not going to use it in production. You can skip it completely since it just development focused.
  • Inside the app, we use ./entrypoint.sh which is the script that runs directly inside the Docker container and handles process signals (SIGTERM, SIGKILL, etc)
  • Managing of Docker images in AWS with ECR is not fully decided yet; I'm doing my first tests there still.
  • ECS+Fargate is still under testing -- but I will probably need to deploy the code from this PR to do real tests. I was able to test it by calling boto3 manually in production and running a build for test-builds-fargate project. The infrastructure works 🎉
  • The implementation for HTML upload done in builds: pre-built HTML upload as a Version source type #13011 is not included in readthedocs-builder yet.

Improvements / Features

  • Each build/project/organization can spin Docker containers with specific vCPU/RAM.
  • Security -- due all the commands run inside a Docker container without access to anything from our infra/code.
  • We can eventually give users root capabilities if we want.
  • No need to have build-* queues and handle queues backups.
  • Reduce Celery headaches a lot.
  • Easy to test different combinations of build instance types.
  • There is no limit of concurrent builds running at a time in the whole platform.
  • If we want to, we can limit the amount of running builds by queuing submit_build_to_ecs tasks instead of handling them immediately. I don't think this is needed tho.
  • We can deploy new code to readthedocs-builder without the need to create an AMI and perform a full deploy, and it will be taken immediately on new Docker containers.

Rollout

  • Behind a feature flag. We can add a subset of projects to start testing this.
  • We need this PR to be merged and deployed using the current process. There is no too much changes in this PR, most of the code is isolated under readthedocs/projects/tasks/fargate.py file. The rest are new model fields, new settings and path decision depending on the feature flag being enabled or not.
  • We can deploy this PR and perform live changes on projects with the feature flag by updating rel branch on readthedocs-builds for quick tests.

Decisions / ToDo

  • The test suite we currently have is not migrated to the new repository. We will need to think and decide what to do there. Since the builder can be ran locally now, we would be able to test more deeply if we want to.
  • We need to work in the TF code for the infrastructure.
  • What are the next steps here?

How to review this PR

  1. Consider the big picture as a whole.
  2. Take a look at the code finding for something that could break production. Those are important to fix at this point.
  3. Review the architecture.md file and try to load all that information in your mind 😄 . We are changing how all the builds are processed.
  4. Look for inconsistencies in that document and potential breaks in data flows and similar.
  5. Don't look at all at the code from readthedocs-builder. I haven't review it either and I'm not worried about it at this point. I will get back to that later, once we have it tested.

Test this locally

  1. Pull down this PR
  2. Clone readthedocs-builder and build the Docker image
docker build -t builder-dev:latest .
  1. Define RTD_PATH_BUILDER environment variable to where you cloned readthedocs-builder (e.g. /home/username/work/readthedocs-builder, it has to be the full path)
  2. Start local development and add use_fargate_builder to a project.
  3. Trigger a build for that project.
  4. Check the logs for that app as usual.
  5. Check the logs for the Docker container with docker logs -f build-<pk>

humitos added 15 commits June 3, 2026 11:35
…BUILDER

First piece of the Fargate builder migration. Adds the data-model surface
the new code path will need; nothing reads or writes these fields yet.

- Build.task_arn (CharField, max_length=255): set by the new
  submit_build_to_ecs Celery task; consumed by cancel_build to call
  ecs:StopTask. Coexists with the legacy task_id field for the duration
  of the rollout.
- Project.container_cpu_limit (PositiveIntegerField): Fargate CPU units
  (1024 = 1 vCPU). Paired with the existing container_mem_limit /
  container_time_limit; defaults to the system-wide default when unset.
- Feature.USE_FARGATE_BUILDER: per-project feature flag the trigger and
  cancel dispatchers will check to decide between the legacy
  update_docs_task path and the new Fargate path.

Migrations are additive (Safe.before_deploy) so the legacy code keeps
working through the rollout.

See readthedocs-builder/docs/architecture.md for the broader design.
Second piece of the Fargate builder migration. Lands the Celery task that
the trigger dispatcher will route to in Phase 5, plus the settings it
relies on. Nothing calls submit_build_to_ecs yet — that wiring happens
in the next piece.

readthedocs/projects/tasks/fargate.py:
- submit_build_to_ecs(build_pk): sparse-clones the YAML, resolves build.os,
  resolves per-build CPU/memory/time-limit, mints a per-build API key,
  calls ecs:RunTask, and stores the returned task ARN on Build.task_arn.
- _sparse_clone_yaml: ``--filter=blob:none --no-checkout`` then
  ``sparse-checkout`` for the four candidate filenames (with/without dot,
  .yaml/.yml). HTTPS clone with token injection; SSH errors out (deferred).
- _read_build_os: parse the YAML, resolve the ``ubuntu-lts-latest`` alias.
- _resolve_fargate_resources: layer project field -> settings default ->
  RTD_BUILD_MAX_* caps -> snap to a Fargate-supported CPU/memory pair.
- _snap_to_fargate_pair: hardcoded AWS Fargate CPU/memory matrix
  (256/512/1024/2048/4096/8192/16384 vCPU tiers).
- _ecs_run_task: boto3 RunTask with FARGATE_SPOT capacity provider, the
  CPU+memory + command + env overrides, and the awsvpc network config.
  AWS errors wrapped in BuildAppError for consistent handling.
- Feature.USE_FARGATE_BUILDER is enforced defensively at the top of the
  task in case the dispatcher routes here without the flag.

readthedocs/settings/base.py:
- RTD_BUILD_DEFAULT_CPU / MEMORY / TIME_LIMIT (2048 / 8192 / 1800).
- RTD_BUILD_MAX_CPU / MEMORY / TIME_LIMIT system-wide caps.
- RTD_BUILD_TIME_LIMIT_GRACE_SECONDS / KILL_SECONDS for the bash watchdog.
- RTD_BUILDER_REPO / REF (default ``rel``).
- RTD_ECS_CLUSTER / TASK_DEFINITION_FORMAT / SUBNETS / SECURITY_GROUPS /
  ASSIGN_PUBLIC_IP / REGION (production-only; empty placeholders here).

Smoke tests:
- ``python -m py_compile`` on the task and migrations.
- _parse_mem_limit_mb on int, ``8g``, ``512m``, plain digits, None, empty,
  garbage.
- _snap_to_fargate_pair: exact pairs, snap-up within tier, lower bound,
  CPU cap, memory cap within tier, both-cap upper bound.

See readthedocs-builder/docs/architecture.md for the broader design.
…uild

Third (and final) piece of the Fargate builder migration. Phase 4 is
now complete: enabling Feature.USE_FARGATE_BUILDER on a project routes
its builds through ``submit_build_to_ecs`` -> Fargate; everything else
continues to use ``update_docs_task`` as before.

readthedocs/core/utils/__init__.py:
- trigger_build: after prepare_build, branch on
  project.has_feature(Feature.USE_FARGATE_BUILDER). Fargate path
  enqueues submit_build_to_ecs.delay(build_pk=build.pk) instead of
  apply_async on the update_docs_task signature. The signature still
  gets built but is unused on the Fargate path — accepted as minor
  wasted work for the rollout window.
- cancel_build: branch on which task identifier is set on the Build:
    1. task_arn (Fargate dispatched) -> ecs:StopTask with reason.
       Failures are logged but don't propagate (e.g. the task may
       have already exited by the time we cancel).
    2. task_id (legacy Celery) -> app.control.revoke as before.
    3. Neither -> log a notice. submit_build_to_ecs guards against
       this case by checking BUILD_STATE_CANCELLED at the top.
  The branch is on the build's *actual* state, not the project's
  current flag value, so an in-flight build that started before the
  flag was flipped still cancels correctly.

readthedocs/projects/tasks/fargate.py:
- submit_build_to_ecs: at the top, bail out if Build.state ==
  BUILD_STATE_CANCELLED. Covers the race where cancel_build flips the
  state while the Celery task is still queued; we skip API-key
  minting + ecs:RunTask + arn write entirely.

Smoke tests:
- ``python -m py_compile`` on both files.
- grep-audit confirms task_arn / USE_FARGATE_BUILDER / stop_task /
  revoke wiring is in place.

See readthedocs-builder/docs/architecture.md (Phase 4) for the
broader design.
When RTD_DOCKER_COMPOSE is set (the local dev environment),
submit_build_to_ecs spawns a sibling builder container on the host's
docker daemon via docker-py instead of calling ecs:RunTask. Same
interface, same env vars; just a different backend behind
_dispatch_build_task.

Lets us test the whole new build path end-to-end against devthedocs.org
+ rustfs without standing up a real ECS cluster.

readthedocs/settings/docker_compose.py:
- RTD_LOCAL_BUILDER_IMAGE (default: builder-dev:latest). Built once by
  the dev via ``cd ../readthedocs-builder && docker build -t builder-dev:latest .``
- RTD_LOCAL_BUILDER_HOST_PATH (from env var). When set, bind-mounted
  at /opt/builder so the entrypoint skips the GitHub clone; matches the
  dev-run.sh iteration loop. Leave unset to exercise the clone path.

readthedocs/projects/tasks/fargate.py:
- _dispatch_build_task: branches on settings.RTD_DOCKER_COMPOSE. Routes
  to _docker_run_task (dev) or _ecs_run_task (prod).
- _docker_run_task: docker-py client.containers.run with cpu/memory
  constraints (Fargate units -> nano_cpus/mem_limit), the compose
  network (RTD_DOCKER_COMPOSE_NETWORK), the optional host bind-mount,
  detach + auto_remove. Returns ``docker://<container-id>`` pseudo-ARN.
- submit_build_to_ecs: when in docker-compose mode, forward AWS_* +
  S3 bucket env vars (matching what dev-run.sh forwards), set
  RTD_DOCKER_USER=root, and append --storage-from-env to the runner
  command so the runner uses env-var-based AWS creds instead of the
  API's STS endpoint.

readthedocs/core/utils/__init__.py (cancel_build):
- If task_arn starts with ``docker://``, look up the container via
  docker.from_env().containers.get(...).kill(). Failures logged, not
  propagated (matches the ecs:StopTask behaviour).
- Otherwise fall through to the existing ecs:StopTask path.

Setup notes (not in this commit; dev's job):
- Mount /var/run/docker.sock into the celery container.
- Build the builder-dev image once.
- Export RTD_LOCAL_BUILDER_HOST_PATH=/absolute/path/to/readthedocs-builder
  before bringing up docker-compose if you want the bind-mount.

See readthedocs-builder/docs/architecture.md.
Three follow-ups to commit c5c00c9:

1. KeyError for the fargate task on the celery worker. Tasks in
   ``readthedocs/projects/tasks/`` are loaded explicitly from
   ``ProjectsConfig.ready()`` (PEP 420 namespace package, no
   __init__.py, so Celery's autodiscover doesn't pick up submodules on
   its own). Added the import for fargate.py alongside builds / search /
   utils.

2. Drop the ``docker://`` prefix on Build.task_arn. The runtime
   distinction is already captured by ``settings.RTD_DOCKER_COMPOSE``;
   sticking with raw container ids / ECS ARNs keeps the field clean.
   cancel_build now branches on the setting instead of parsing a
   prefix.

3. The runner uses the API's STS endpoint for storage credentials in
   both production and local dev (mirrors production semantics). Drop
   --storage-from-env from the dispatched command and stop forwarding
   AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / RTD_S3_*_BUCKET. We
   still forward AWS_S3_ENDPOINT_URL because boto3 needs to know where
   to point in dev (rustfs at http://storage:9000); credentials and
   bucket names come from /api/v2/build/<id>/credentials/storage.

Rename: RTD_LOCAL_BUILDER_HOST_PATH -> RTDDEV_PATH_BUILDER for
consistency with the existing RTDDEV_* naming convention.

RTD_LOCAL_BUILDER_IMAGE keeps a TODO marking it for removal once the
readthedocs/builder:<os> image matrix exists and we can pick the image
from build.os exactly like production does.
Matches the convention the rest of the codebase uses for web-celery
tasks (audit / telemetry / subscriptions / search / etc. all use
queue="web"). Otherwise the task defaults to the celery queue and gets
picked up by the build worker, which doesn't know about it.
- no YAML file
- invalid clone URL
- etc
@humitos humitos requested a review from a team as a code owner June 4, 2026 15:46
@humitos humitos requested a review from stsewd June 4, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant