Builds: isolated builds#13104
Open
humitos wants to merge 15 commits into
Open
Conversation
…BUILDER First piece of the Fargate builder migration. Adds the data-model surface the new code path will need; nothing reads or writes these fields yet. - Build.task_arn (CharField, max_length=255): set by the new submit_build_to_ecs Celery task; consumed by cancel_build to call ecs:StopTask. Coexists with the legacy task_id field for the duration of the rollout. - Project.container_cpu_limit (PositiveIntegerField): Fargate CPU units (1024 = 1 vCPU). Paired with the existing container_mem_limit / container_time_limit; defaults to the system-wide default when unset. - Feature.USE_FARGATE_BUILDER: per-project feature flag the trigger and cancel dispatchers will check to decide between the legacy update_docs_task path and the new Fargate path. Migrations are additive (Safe.before_deploy) so the legacy code keeps working through the rollout. See readthedocs-builder/docs/architecture.md for the broader design.
Second piece of the Fargate builder migration. Lands the Celery task that the trigger dispatcher will route to in Phase 5, plus the settings it relies on. Nothing calls submit_build_to_ecs yet — that wiring happens in the next piece. readthedocs/projects/tasks/fargate.py: - submit_build_to_ecs(build_pk): sparse-clones the YAML, resolves build.os, resolves per-build CPU/memory/time-limit, mints a per-build API key, calls ecs:RunTask, and stores the returned task ARN on Build.task_arn. - _sparse_clone_yaml: ``--filter=blob:none --no-checkout`` then ``sparse-checkout`` for the four candidate filenames (with/without dot, .yaml/.yml). HTTPS clone with token injection; SSH errors out (deferred). - _read_build_os: parse the YAML, resolve the ``ubuntu-lts-latest`` alias. - _resolve_fargate_resources: layer project field -> settings default -> RTD_BUILD_MAX_* caps -> snap to a Fargate-supported CPU/memory pair. - _snap_to_fargate_pair: hardcoded AWS Fargate CPU/memory matrix (256/512/1024/2048/4096/8192/16384 vCPU tiers). - _ecs_run_task: boto3 RunTask with FARGATE_SPOT capacity provider, the CPU+memory + command + env overrides, and the awsvpc network config. AWS errors wrapped in BuildAppError for consistent handling. - Feature.USE_FARGATE_BUILDER is enforced defensively at the top of the task in case the dispatcher routes here without the flag. readthedocs/settings/base.py: - RTD_BUILD_DEFAULT_CPU / MEMORY / TIME_LIMIT (2048 / 8192 / 1800). - RTD_BUILD_MAX_CPU / MEMORY / TIME_LIMIT system-wide caps. - RTD_BUILD_TIME_LIMIT_GRACE_SECONDS / KILL_SECONDS for the bash watchdog. - RTD_BUILDER_REPO / REF (default ``rel``). - RTD_ECS_CLUSTER / TASK_DEFINITION_FORMAT / SUBNETS / SECURITY_GROUPS / ASSIGN_PUBLIC_IP / REGION (production-only; empty placeholders here). Smoke tests: - ``python -m py_compile`` on the task and migrations. - _parse_mem_limit_mb on int, ``8g``, ``512m``, plain digits, None, empty, garbage. - _snap_to_fargate_pair: exact pairs, snap-up within tier, lower bound, CPU cap, memory cap within tier, both-cap upper bound. See readthedocs-builder/docs/architecture.md for the broader design.
…uild
Third (and final) piece of the Fargate builder migration. Phase 4 is
now complete: enabling Feature.USE_FARGATE_BUILDER on a project routes
its builds through ``submit_build_to_ecs`` -> Fargate; everything else
continues to use ``update_docs_task`` as before.
readthedocs/core/utils/__init__.py:
- trigger_build: after prepare_build, branch on
project.has_feature(Feature.USE_FARGATE_BUILDER). Fargate path
enqueues submit_build_to_ecs.delay(build_pk=build.pk) instead of
apply_async on the update_docs_task signature. The signature still
gets built but is unused on the Fargate path — accepted as minor
wasted work for the rollout window.
- cancel_build: branch on which task identifier is set on the Build:
1. task_arn (Fargate dispatched) -> ecs:StopTask with reason.
Failures are logged but don't propagate (e.g. the task may
have already exited by the time we cancel).
2. task_id (legacy Celery) -> app.control.revoke as before.
3. Neither -> log a notice. submit_build_to_ecs guards against
this case by checking BUILD_STATE_CANCELLED at the top.
The branch is on the build's *actual* state, not the project's
current flag value, so an in-flight build that started before the
flag was flipped still cancels correctly.
readthedocs/projects/tasks/fargate.py:
- submit_build_to_ecs: at the top, bail out if Build.state ==
BUILD_STATE_CANCELLED. Covers the race where cancel_build flips the
state while the Celery task is still queued; we skip API-key
minting + ecs:RunTask + arn write entirely.
Smoke tests:
- ``python -m py_compile`` on both files.
- grep-audit confirms task_arn / USE_FARGATE_BUILDER / stop_task /
revoke wiring is in place.
See readthedocs-builder/docs/architecture.md (Phase 4) for the
broader design.
When RTD_DOCKER_COMPOSE is set (the local dev environment), submit_build_to_ecs spawns a sibling builder container on the host's docker daemon via docker-py instead of calling ecs:RunTask. Same interface, same env vars; just a different backend behind _dispatch_build_task. Lets us test the whole new build path end-to-end against devthedocs.org + rustfs without standing up a real ECS cluster. readthedocs/settings/docker_compose.py: - RTD_LOCAL_BUILDER_IMAGE (default: builder-dev:latest). Built once by the dev via ``cd ../readthedocs-builder && docker build -t builder-dev:latest .`` - RTD_LOCAL_BUILDER_HOST_PATH (from env var). When set, bind-mounted at /opt/builder so the entrypoint skips the GitHub clone; matches the dev-run.sh iteration loop. Leave unset to exercise the clone path. readthedocs/projects/tasks/fargate.py: - _dispatch_build_task: branches on settings.RTD_DOCKER_COMPOSE. Routes to _docker_run_task (dev) or _ecs_run_task (prod). - _docker_run_task: docker-py client.containers.run with cpu/memory constraints (Fargate units -> nano_cpus/mem_limit), the compose network (RTD_DOCKER_COMPOSE_NETWORK), the optional host bind-mount, detach + auto_remove. Returns ``docker://<container-id>`` pseudo-ARN. - submit_build_to_ecs: when in docker-compose mode, forward AWS_* + S3 bucket env vars (matching what dev-run.sh forwards), set RTD_DOCKER_USER=root, and append --storage-from-env to the runner command so the runner uses env-var-based AWS creds instead of the API's STS endpoint. readthedocs/core/utils/__init__.py (cancel_build): - If task_arn starts with ``docker://``, look up the container via docker.from_env().containers.get(...).kill(). Failures logged, not propagated (matches the ecs:StopTask behaviour). - Otherwise fall through to the existing ecs:StopTask path. Setup notes (not in this commit; dev's job): - Mount /var/run/docker.sock into the celery container. - Build the builder-dev image once. - Export RTD_LOCAL_BUILDER_HOST_PATH=/absolute/path/to/readthedocs-builder before bringing up docker-compose if you want the bind-mount. See readthedocs-builder/docs/architecture.md.
Three follow-ups to commit c5c00c9: 1. KeyError for the fargate task on the celery worker. Tasks in ``readthedocs/projects/tasks/`` are loaded explicitly from ``ProjectsConfig.ready()`` (PEP 420 namespace package, no __init__.py, so Celery's autodiscover doesn't pick up submodules on its own). Added the import for fargate.py alongside builds / search / utils. 2. Drop the ``docker://`` prefix on Build.task_arn. The runtime distinction is already captured by ``settings.RTD_DOCKER_COMPOSE``; sticking with raw container ids / ECS ARNs keeps the field clean. cancel_build now branches on the setting instead of parsing a prefix. 3. The runner uses the API's STS endpoint for storage credentials in both production and local dev (mirrors production semantics). Drop --storage-from-env from the dispatched command and stop forwarding AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / RTD_S3_*_BUCKET. We still forward AWS_S3_ENDPOINT_URL because boto3 needs to know where to point in dev (rustfs at http://storage:9000); credentials and bucket names come from /api/v2/build/<id>/credentials/storage. Rename: RTD_LOCAL_BUILDER_HOST_PATH -> RTDDEV_PATH_BUILDER for consistency with the existing RTDDEV_* naming convention. RTD_LOCAL_BUILDER_IMAGE keeps a TODO marking it for removal once the readthedocs/builder:<os> image matrix exists and we can pick the image from build.os exactly like production does.
Matches the convention the rest of the codebase uses for web-celery tasks (audit / telemetry / subscriptions / search / etc. all use queue="web"). Otherwise the task defaults to the celery queue and gets picked up by the build worker, which doesn't know about it.
- no YAML file - invalid clone URL - etc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR allows us to change the architecture of how we run builds in Read the Docs. The architecture document describes in depth how is the implementation for this:
architecture.md. Please, read that document and let me know if you have any concern about it.Notes
./dev-run.shscript in the repository that I used to develop the Python script that builds the docs inside a container. We are not going to use it in production. You can skip it completely since it just development focused../entrypoint.shwhich is the script that runs directly inside the Docker container and handles process signals (SIGTERM, SIGKILL, etc)ECS+Fargate is still under testing -- but I will probably need to deploy the code from this PR to do real tests.I was able to test it by callingboto3manually in production and running a build fortest-builds-fargateproject. The infrastructure works 🎉readthedocs-builderyet.Improvements / Features
rootcapabilities if we want.build-*queues and handle queues backups.submit_build_to_ecstasks instead of handling them immediately. I don't think this is needed tho.readthedocs-builderwithout the need to create an AMI and perform a full deploy, and it will be taken immediately on new Docker containers.Rollout
readthedocs/projects/tasks/fargate.pyfile. The rest are new model fields, new settings and path decision depending on the feature flag being enabled or not.relbranch onreadthedocs-buildsfor quick tests.Decisions / ToDo
How to review this PR
architecture.mdfile and try to load all that information in your mind 😄 . We are changing how all the builds are processed.readthedocs-builder. I haven't review it either and I'm not worried about it at this point. I will get back to that later, once we have it tested.Test this locally
readthedocs-builderand build the Docker imagedocker build -t builder-dev:latest .RTD_PATH_BUILDERenvironment variable to where you clonedreadthedocs-builder(e.g./home/username/work/readthedocs-builder, it has to be the full path)use_fargate_builderto a project.docker logs -f build-<pk>