ClickBench Playground#904
Open
alexey-milovidov wants to merge 195 commits into
Open
Conversation
WIP checkpoint. Lets visitors run SQL against any of the 80+ ClickBench
systems via a single-page UI, each isolated in a per-system Firecracker
microVM.
- server/ aiohttp API: /api/systems, /api/state, /api/query,
/api/admin/provision. Owns the per-system VM lifecycle,
a 1-Hz CPU/disk/host-pressure watchdog, and a batched
ClickHouse-Cloud logging sink (JSONL fallback).
- agent/ stdlib HTTP agent that runs inside each VM and wraps the
system's install/start/load/query scripts.
- images/ scripts to build the base Ubuntu 22.04 rootfs + per-system
rootfs/system-disk pair (200 GB sparse + 16/88 GB sized
for the system's data format).
- web/ vanilla JS SPA — system picker, query box, X-Query-Time /
X-Output-Truncated rendering.
Smoke-tested: base rootfs boots under Firecracker, agent comes up in
~2 s, /health and /stats respond. Agent self-test on the host (no VM)
covers all 4 endpoints including 10 KB output truncation. ClickHouse
provisioning is in flight; see playground/docs/build-progress.md for
the running checkpoint.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A later `umount -lR` on the chroot's /dev was propagating through the shared mount group and tearing down the host's /dev/pts, breaking sshd's PTY allocation. `--make-rslave` keeps mount events flowing *into* the chroot but blocks unmounts from leaking back to the host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A 16 GB guest snapshot.bin compresses to ~2 GB once we
1) stop+start the system daemon (sheds INSERT-time heap arenas,
buffers, fresh allocator pages),
2) echo 3 > drop_caches (turns 3-5 GB of page cache into zero
pages),
3) zstd -T0 -3 --long=27 (parallel, big match window — most of
the savings come from those zero pages).
Restart is skipped for in-process engines where stop/start is a
no-op AND the data lives in the process; wiping it would defeat
the whole point.
The host now keeps snapshot.bin.zst as the canonical artifact and
decompresses on demand right before /snapshot/load. snapshot.bin
itself is deleted after a successful restore + teardown.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous version threw away stdout/stderr from the pre-snapshot stop/start cycle, so a silent failure (`sudo clickhouse start` failing because the data dir was still locked by the dying daemon, etc.) left us with a snapshot of a dead clickhouse-server — restored VMs then returned "Connection refused (localhost:9000)" on every query and the only way to recover was to manually delete the snapshot. Capture stdout+stderr into the provision log so the failure mode is visible via GET /provision-log, and refuse to mark PROVISION_DONE if ./check doesn't recover within the timeout. The host then sees /provision return 500 and skips the snapshot step entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PROVISION_DONE lives on the rootfs disk (/var/lib/clickbench-agent/),
which persists across VM cold-boots. So on the second provision after
the host deleted the snapshot files, the agent saw PROVISION_DONE
already set and returned "already provisioned" — but the daemon
itself wasn't running (cold boot, no clickhouse-server in systemd),
so the host snapshotted an empty VM and every restored query came back
with "Connection refused (localhost:9000)".
Two fixes:
1. Agent: on every startup, if PROVISION_DONE is set, kick ./start
in a background thread. start is idempotent for the systems that
have a daemon, so it costs nothing when the daemon is already up
(post-restore) and brings it up when the rootfs is being re-used
across a cold reboot.
2. Host: when (re-)provisioning a system with no snapshot, drop the
existing rootfs.ext4 so install/start/load run fresh. The
system.ext4 (which holds ~14 GB of pre-staged dataset) is preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cloud image ships hostname=ubuntu but /etc/hosts only maps 'localhost' to 127.0.0.1. Every sudo invocation inside the VM then tries to reverse-resolve 'ubuntu' against the network — which has no DNS after the snapshot drops internet — and pays the ~2 s resolver timeout. With several sudos per ./query, that's a multi-second floor on every query, visible in the firecracker log as repeated 'sudo: unable to resolve host ubuntu: Name or service not known'. Mapping ubuntu to 127.0.0.1 short-circuits the lookup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The mid-snapshot checksum-mismatch I attributed to "stopping the
daemon mid-merge" was actually FS corruption: KVM pauses the vcpus
the moment we call /vm Paused, and any ext4 writeback that was in
flight at that instant gets captured by the snapshot as half-flushed.
On restore the page cache references on-disk blocks that never landed,
and the next read sees a torn write.
Fix:
1. Drop the pre-snapshot stop/start. Killing ClickHouse at any
point never corrupts on-disk MergeTree data — only an unflushed
FS can.
2. Add a /sync endpoint to the agent and call it from the host
right before /vm Paused, so all dirty pages have hit virtio-blk
before KVM freezes the vcpus.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the host /syncs the FS before pausing the vcpus, the snapshot captures consistent on-disk state regardless of when the daemon exits (MergeTree's on-disk format is durable under arbitrary process exit; only an unflushed *filesystem* corrupts it). So we can shut the daemon down here to evict its private heap (merge thread arenas, query cache, mark cache, uncompressed cache, ingest buffers) and snapshot what's left — mostly zero-fill RAM, which zstd compresses ~300:1. Restore path is unchanged: _kick_daemon_if_provisioned at agent startup brings the daemon back up on every cold restore. First query in a restored VM pays a 1-2 s daemon-start cost instead of carrying 8-12 GB of memory in every snapshot. In-process engines (chdb, polars, …) keep all state in RAM and have no daemon to stop; for them, has_daemon is false and we skip the stop step, falling back to drop_caches alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes for the small-snapshot path:
1. Pass init_on_free=1 in the guest kernel cmdline. Linux normally
leaves freed page frames with whatever bytes were last written to
them, so the post-`clickhouse stop` free pool was ~10 GB of stale
daemon heap and Firecracker's snapshot dump compressed only ~3:1.
init_on_free=1 zeros every page as it goes onto the free list, so
the snapshot's RAM region is genuinely zero-filled and zstd hits
~300:1.
2. Add `_ensure_daemon_started` at the top of the agent's /query
handler. After a snapshot restore (taken with the daemon stopped),
the restored memory has no daemon process and `localhost:9000`
refuses connections. The cold-boot `_kick_daemon_if_provisioned`
only fires on actual cold boots, not on snapshot resumes, so we
need an explicit check at query time. Lock-protected so concurrent
/query requests don't try to ./start the daemon twice; idempotent
and free once the daemon is up.
Also dropped the userspace _zero_free_ram hack — init_on_free does
it natively at no userspace cost.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end working with a 35 MB snapshot (16 GiB raw, ~470x ratio): SELECT COUNT(*) returns 99997497 cleanly, GROUP BY URL produces the expected top-N without any checksum errors, output truncation caps a 244 KB result at 10 KB with the right header set. Cold path (snapshot restore + daemon start): ~10 s. Warm path (live VM): subsecond on COUNT / MIN-MAX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two correctness/efficiency fixes: 1. Shared read-only datasets disk. Previously each per-system rootfs embedded its own copy of hits.parquet / hits.tsv / hits.csv (14-75 GB each), so the catalog needed ~1-2 TB of redundant dataset storage on the host. Build one shared datasets.ext4 instead, attach to every VM read-only at LABEL=cbdata, and have the agent copy the bytes the system actually needs from /opt/clickbench/datasets into the writable per-system disk at provision time only. The agent uses os.copy_file_range so the in-VM copy is kernel-side, not bounced through userspace. 2. Golden-disk snapshot/restore. Firecracker's snapshot.bin only saves memory; the disk image referenced by the in-memory state is the live file. If anything modifies it between snapshots (background merges, log writes, /tmp churn) the next /snapshot/load points at the new disk while replaying old memory references. We were getting away with this because clickhouse-server happens to be tolerant, but it's fragile. Now /snapshot also renames the working disks into `*.golden.ext4`, and /restore-snapshot clones the goldens back into fresh working copies via `cp --sparse=always`. Every restore starts from the exact disk state captured at snapshot time. 3. Bound per-system disk builds and provisions via asyncio.Semaphore (PLAYGROUND_BUILD_CONCURRENCY=6, PLAYGROUND_PROVISION_CONCURRENCY=32) so kicking off 98 systems at once doesn't thrash the host NVMe or rate-limit Ubuntu mirrors. 4. Re-enabled `ursa` in the playground catalog (was incorrectly in the _EXTERNAL exclude list; it runs locally). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous design copied dataset files from the read-only cbdata mount into the per-VM writable cbsystem disk on every provision — 14 GB for parquet systems, 75 GB for tsv/csv. That worked but was redundant: the data is already on a read-only mount, the only reason we copied was that ClickBench's load scripts do `sudo mv` and `sudo chown` on the dataset files. Use overlayfs instead: lowerdir = /opt/clickbench/datasets_ro (RO, the shared image) upperdir = /opt/clickbench/system_upper (RW per-VM disk with scripts) merged at /opt/clickbench/system The system's load runs at cwd=/opt/clickbench/system. It sees scripts + dataset files in one tree. When it `mv`s or `chown`s a file from the lower, overlayfs does a lazy copy-up: only the file's bytes get materialised into the upper, and only when the script actually mutates it. Most ClickBench load scripts `rm` the dataset file after INSERT, which becomes a whiteout in the upper — a few bytes of metadata, not a 75 GB copy. Saves ~1-2 TB across the catalog on host disk (no per-system copies) *and* eliminates the per-provision in-VM stage. Only cost: small metadata to maintain the overlay (kilobytes). For partitioned parquet, the source files live in datasets_ro/hits_partitioned/ but the load globs cwd/hits_*.parquet, so the agent creates symlinks in the upper pointing at the lower — ~100 symlinks, a few hundred bytes total. Also: make build-datasets-image.sh idempotent. The 173 GB rsync into datasets.ext4 only needs to run when the source dir's mtime has changed; otherwise the cached image is reused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes for the parallel-provisioning-98-systems path: 1. The _build_sem and _provision_sem fields were defined but never acquired — `provision-all.sh` kicked all 98 provisions at once and they each independently spawned build-system-rootfs.sh, which tried to write ~8 GB of rootfs base content × 98 in parallel (~780 GB of writes against a single NVMe). Disk got saturated and nothing finished. Use `async with self._build_sem:` and `async with self._provision_sem:` around the heavy phases. 2. build-system-rootfs.sh now clones the base image at block level with `cp --sparse=always` and resizes the filesystem to 200 GB in place, instead of mkfs.ext4 + mount + rsync-of-base-contents. The block-level clone touches only the ~2 GB of non-zero blocks in the base, vs. the rsync approach traversing the mounted base and writing every file individually. Per-system rootfs build goes from ~30 s to ~3 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the agent created symlinks in the overlay's upper for partitioned parquet (hits_partitioned/* -> upper/hits_*.parquet) because the source directory was nested. That fell apart on clickhouse's load: `mv hits_*.parquet /var/lib/clickhouse/user_files/` moved the symlinks, and the subsequent `chown` followed them through to the read-only datasets disk and got `Read-only file system`. Flatten the dataset image so all 100 partitioned parquet files sit at the root next to hits.parquet / hits.tsv / hits.csv. The overlay then exposes them directly at /opt/clickbench/system as real files, no symlinks involved. clickhouse's `mv` becomes a real copy-up (and the source becomes a whiteout in upper), and the subsequent `chown` operates on a regular file on the rootfs — works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2 GB cap on the per-VM system disk was a holdover from the in-VM-copy era, when system.ext4 only held scripts + staged data. Once we switched to overlay-with-RO-datasets, system.ext4 also holds the overlay's upperdir + workdir — i.e. every byte the load script writes lands there, including the database's own files. ClickHouse writes ~5 GB of MergeTree parts, DuckDB ~6 GB, Hyper ~10 GB; chown on partitioned parquet copies up another 14 GB. 2 GB was always going to overflow. Match the rootfs at 200 GB (apparent). The file is sparse: truncate reserves the size but allocates no physical blocks, mkfs.ext4 writes ~50 MB of metadata, and the snapshot/restore path uses `cp --sparse=always` so only the bytes the VM actually wrote land on the host disk. Light systems (chdb, sqlite, ...) cost the host near nothing; heavy ones (tidb at ~137 GB, postgres-indexed ~80 GB) fit without hitting ENOSPC mid-load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each per-system rootfs build was running `e2fsck -fy` on its clone before `resize2fs`. With 98 systems and ~5 s per fsck of a 200 GB sparse file, that's ~8 minutes of pure disk thrash during catalog build — and entirely redundant: the base ext4 is built fresh and never mounted dirty, so the bit-for-bit clone is clean too. Move the single fsck to the end of build-base-rootfs.sh (where it has all the host's I/O to itself) and skip it in the per-system loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The base ext4 used to be built at 8 GB and each per-system rootfs clone ran resize2fs to grow to 200 GB. resize2fs on a 200 GB file is disk-heavy (it has to write group descriptor and bitmap metadata for every additional block group), and we did it 98 times in parallel. Build the base directly at 200 GB sparse with lazy_itable_init=1,lazy_journal_init=1. mkfs writes ~50 MB of superblock + GDT material upfront and defers the rest to lazy background init, so the image file's physical footprint is unchanged from the previous 8 GB layout (~1.8 GB). Per-system clones then need only `cp --sparse=always`: no resize2fs, no e2fsck, ~1 second each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`umount` already syncs the filesystem being unmounted. The host-wide `sync` we were calling first flushes every dirty page on *every* mount — under 98-way parallel builds, each build's sync blocked on every other build's writeback, multiplying the wall-clock cost. Drop them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…olden When clickhouse's load `mv hits.parquet /var/lib/clickhouse/user_files/` (or any cross-FS move) copies the 14-75 GB dataset into the writable per-VM disk and then `rm`'s it after INSERT, ext4 marks those blocks free but the underlying virtio-blk file still carries the bytes. `cp --sparse=always` on the golden then preserves them as random data, so the per-system snapshot for a parquet engine carried a full extra copy of the dataset that the load already discarded. Adding `fstrim /opt/clickbench/sysdisk` and `fstrim /` before the host's snapshot makes the guest issue DISCARD for free blocks; the host loop driver responds by punching holes in the sparse backing file (linux loop devices advertise discard with PUNCH_HOLE since 4.x, which firecracker's virtio-blk passes through). The golden then holds only the bytes the engine actually keeps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several systems' load scripts do `sudo mv hits_*.parquet /var/lib/<engine>/user_files/` or `sudo cp hits.csv .../extern/` followed by `chown` to the daemon's user. The mv/cp copies 14-75 GB of data the daemon reads once during INSERT and we delete right after — a complete waste of bytes on disk and time on the wire. Replace with `ln -s` + `chown -h` where the daemon's user-files dir is on a different filesystem from the dataset. `chown -h` chowns the symlink itself rather than following into the (often read-only) original; the underlying dataset is mode 644 anyway, so daemon processes can read through the symlink as their own user. Systems updated: clickhouse, clickhouse-tencent, pg_clickhouse, kinetica, oxla, ursa, arc, cockroachdb. Motivated by the ClickBench playground (Firecracker microVM service) where the dataset is mounted read-only and shared across all VMs; the copy step was the dominant cost on parquet/csv-format systems and pulled 14 GB into the per-VM snapshot golden disk unnecessarily. The change is also benign for the regular benchmark — daemons still read the same bytes, just through a symlink. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8080 is the default HTTP admin port for cockroach, the spark UI, trino, presto, druid, and a long tail of other JVM-based databases in the catalog. Our in-VM agent was binding it first, so when their ./start ran the daemon failed with "bind: address already in use" and the whole provision came down with a port conflict. Pick 50080 — uncommon enough that no ClickBench engine in the current catalog wants it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several systems' load scripts call ../lib/download-hits-* — e.g. doris-parquet expects `download-hits-parquet-partitioned <doris_be_dir>` to materialize the dataset in a specific subdirectory of the BE's working tree. Previously we copied the lib tree into /opt/clickbench/ system/_lib, but ../lib from the system dir resolves to /opt/clickbench/lib, not /opt/clickbench/system/_lib. Put 4 stub scripts (one per format) at /opt/clickbench/lib in the base rootfs. Each one symlinks from the shared RO dataset mount into the target directory — same interface as upstream's wget-based scripts, but instant and zero-byte-on-disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The firecracker-ci kernel is minimal: it boots fine, but Docker fails to start because it lacks iptables/nat, br_netfilter, veth and other modules that Docker needs to set up its bridge network. That killed ~6 Docker-using systems (byconity, cedardb, citus, cloudberry, greenplum) in the parallel provisioning run. Swap in Ubuntu's `linux-image-generic` kernel (the same one Ubuntu ships for cloud KVM guests). It has every Docker-required module plus a much richer driver set, while still booting under Firecracker. Trade-off: it lacks CONFIG_IP_PNP so the kernel's `ip=` boot arg is ignored. Add a tiny clickbench-net.service that parses `ip=` from /proc/cmdline and applies it to eth0 at boot; agent.service waits for it. The same rootfs continues to work with the firecracker-ci kernel (the systemd unit's `ip addr add` is idempotent — kernel-set IPs are already there). Verified: smoke-boot agent answered in 3 s on the new kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Ubuntu generic kernel builds overlay, veth, br_netfilter, iptable_nat, nf_conntrack and friends as loadable modules, not built-in. Without /lib/modules/<ver>/ in the rootfs the kernel can't load them at runtime — the immediate symptom was `Failed to mount /opt/clickbench/system` (overlayfs not available) and Docker still failing to start (no br_netfilter/iptable_nat). Drop the linux-modules-7.0.0-15-generic deb into the chroot, `dpkg --unpack` it into the rootfs, run `depmod`, and pre-load the critical modules via /etc/modules-load.d/clickbench.conf so they're ready before any service starts. The image grew from 1.8 to 2.0 GB physical (200 GB apparent) — modules add ~200 MB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dpkg --unpack` records the modules package in dpkg's status DB without configuring it; subsequent `apt-get install` calls inside every per-system VM see an unconfigured package with unmet dependencies and bail with "Unmet dependencies. Try 'apt --fix-broken install'". That broke ~10 systems in the previous parallel run. Switch to `dpkg-deb -x` — extracts the data tarball into the rootfs without touching dpkg's DB. apt sees a normal system with all modules in /lib/modules/, and the kernel can load them at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot of the state after the 10th parallel run. Documents:
- what works end-to-end (microVM lifecycle, shared RO datasets disk,
per-restore disk hygiene, fstrim before snapshot, Ubuntu kernel
with modules)
- bug fixes pushed during the run (port 8080 conflict, mv→ln -s,
download-hits stubs, build/provision semaphores, redundant fsck/
resize2fs/sync removed, clickbench-net.service, kernel module
preload, 200 GB system disk for heavy systems)
- failure categories observed
- what's left for the long tail
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three independent failures observed in the 10th parallel run:
1. The 7 pg_* systems (pg_clickhouse, pg_duckdb*, pg_ducklake,
pg_mooncake) all failed to spawn firecracker with
`Firecracker panicked at main.rs:296: Invalid instance ID:
InvalidChar('_')`. Firecracker's --id rejects underscores. Map
`_` to `-` for the fc id (the system name itself stays intact).
2. duckdb / chdb-dataframe / duckdb-dataframe OOM-killed at 16 GB
("Out of memory: Killed process 578 (duckdb) anon-rss:15926176kB").
DuckDB and chdb hold the full dataset in memory during INSERT;
16 GB just isn't enough for the 100 M row hits set. Bump default
VM memory to 32 GB. KVM allocates lazily, so 98×32 GB on the host
is fine.
3. monetdb's install fails with `$USER: unbound variable`. systemd's
default service env has no USER/LOGNAME. Stamp them as root in
clickbench-agent.service so subprocess.run inherits them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClickBench: fix elasticsearch load.py bytes/str mix
VM tweaks for the long tail of failures:
- chdb-dataframe / duckdb-dataframe materialize the full hits dataset
in process memory and need >32 GB. Default to 48 GB.
- Druid / Pinot / similar JVM stacks take 5-10 min to come up
(Zookeeper → Coordinator → Broker → Historical, in sequence). The
agent's 300 s check-loop wasn't enough; widen to 900 s.
elasticsearch/load.py: gzip.open in mode='rt' returns str docs, but
bulk_stream yields bytes for ACTION_META_BYTES and str for the doc.
requests.adapters.send() calls sock.sendall() on the mixed iterable
and crashes with `TypeError: a bytes-like object is required, not
'str'`. Open in 'rb' so docs are bytes — matches the rest of the
generator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chdb-dataframe, duckdb-dataframe, polars-dataframe, daft-parquet, daft-parquet-partitioned load the whole hits dataset into a single in-process DataFrame. Observed peak RSS is 80-100 GB on the partitioned parquet set — even though KVM allocates lazily, sustaining that working set for shared use isn't feasible. Disable them in the registry rather than bump RAM for everyone. Revert the default per-VM RAM cap to 16 GB. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
duckdb-memory's load OOM'd at 16 GB anon-rss — it's the same RAM- resident model as duckdb-dataframe/chdb-dataframe, just packaged as its own ClickBench entry. Add to the disabled-systems list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mongosh routes console.error() through its own log formatter rather than to process.stderr the way Node REPL does, so the elapsed time the eval block was printing never reached the agent's _extract_script_timing(stderr) parser. The UI's Time: column was empty for every mongo query. Wrap the mongosh invocation in shell-side date arithmetic and emit the seconds to stderr ourselves. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous attempt set --memory=256g --memory-swap=-1 --memory-swappiness=100, but on cgroup v2 the swappiness flag is silently discarded and any --memory cap creates a hard cgroup ceiling that the kernel will OOM on regardless of swap. Let Umbra run with no docker memory cgroup and rely on the host kernel + 256 GiB swap drive. Also raise vm.max_map_count to 1048576 — Umbra issues many small mmaps for its memory-mapped storage and a 100M-row COPY blows past the 65530 default well before any OOM-killer fires. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… binary The trino:455 image ships no /usr/bin/find, so the previous 'find /usr/lib/trino -name "*.jar"' classpath collector silently returned empty and javac failed with 'package com.amazonaws.auth does not exist'. Use a brace-glob over the two specific HDFS-plugin jars (aws-java-sdk-core and hadoop-apache) and match either the legacy 'com.amazonaws_' / 'io.trino.hadoop_' name prefix used by older Trino builds or the bare modern name. Tested: javac produces S3AnonymousProvider.class against /usr/lib/trino/plugin/hive/hdfs/aws-java-sdk-core-1.12.770.jar /usr/lib/trino/plugin/hive/hdfs/hadoop-apache-3.3.5-3.jar Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
omnisci/core-os-cpu:v5.10.2 ships with an empty allowed-import-paths, so the load script's COPY hits FROM '/tmp/hits.csv' fails with 'File or directory path "/tmp/hits.csv" is not whitelisted.' Drop an omnisci.conf with [/tmp/] on the allowlist into heavyai-storage before launching the container — the startomnisci wrapper picks it up automatically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tursodb has been panicking partway through .import: thread 'main' panicked at core/storage/sqlite3_ondisk.rs:818:5: assertion failed: !*syncing.borrow() note: run with `RUST_BACKTRACE=1` environment variable ... The note speaks for itself. Set RUST_BACKTRACE=1 so the panic line in the provision log (and any UI-facing panic from /query) ships with a call stack for the upstream bug report. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Engines like Elasticsearch, Quickwit, Parseable, Druid return raw
JSON for every query, which currently lands in the output pane as
a single 200-char unwrapped line. If the body is a parseable JSON
object or array, re-emit it with 2-space indentation.
Cheap pre-filter (first non-whitespace byte must be '{' or '[')
keeps us from feeding 14 GB count(*) results through JSON.parse.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
SHOW BACKENDS TSV columns are 1 BackendId 2 IP 3 HeartbeatPort 4 BePort 5 HttpPort 6 BrpcPort 7 LastStartTime 8 LastHeartbeat 9 Alive ... We were inspecting column 10 (SystemDecommissioned), which is always "false" once the BE is registered — so the wait loop in ./start timed out even when the backend was alive and serving. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The output cap was raised to 256 KB (CLICKBENCH_OUTPUT_LIMIT, enforced inside the in-VM agent), but README.md and build-progress.md still named '10 KB' and the host-side config still carried an unused output_limit_bytes field with a 10 * 1024 default. Align the docs to reality and remove the dead config field (plus the _env_bytes helper that only fed it). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the per-VM _query_lock. Per-system ./query scripts are already careful with scratch state (use \$\$ / mktemp; redirect to sockets the daemon owns) and a quick audit shows no remaining fixed /tmp/<name> paths. Engines whose runtime client takes an exclusive file lock (embedded DuckDB on hits.db, ...) will fail one of two concurrent requests with their normal lock error — that's visible to the user, and the right answer at the engine level is server-mode or per-connection databases. /provision keeps its own lock. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The file was a snapshot of what was wired up early in the playground bring-up. The real source of truth is the code + README; everything else here has drifted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A point-in-time write-up of the first parallel-provision run; the playground has moved on (snapshot/restore overhaul, per-VM swap, btime-watcher agent, sysdisk overrides, ...) and the report is no longer accurate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ap.sql
The SQL file used to take freshly-rotated writer/reader passwords +
the writer's IP as substitution parameters, but those statements
were moved into clickhouse_bootstrap.py (which generates the
passwords from a state file). The header comment in the SQL still
listed the three parameters; only {db:Identifier} is left.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The reader user is created in ClickHouse with sha256_hash of the empty string, so clients authenticate with just the username and no password. The Credentials.reader_password field was a permanent empty string fed straight into aiohttp.BasicAuth(_, "") which is equivalent to BasicAuth(_). Remove the field; pass only the user. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
clickhouse-web ATTACHes the hits table to a remote web disk pointed at https://clickhouse-public-datasets.s3.amazonaws.com/web/ — nothing is downloaded during ./load, parts stream on demand at query time, with /dev/shm/clickhouse/ as a local cache. Drop it from the _EXTERNAL exclusion and grant DATALAKE_FILTERED so the SNI-restricted proxy lets the S3 calls through post-snapshot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…, kinetica
databend, kinetica already have install/start/check/load/query/stop;
just drop them from the _EXTERNAL exclusion. Both run as self-hosted
binaries / docker images.
firebolt + parquet variants only had run.sh + benchmark.sh (the
monolithic format), so add per-step scripts wrapping the
ghcr.io/firebolt-db/firebolt-core:preview-rc docker image:
install/ docker pull
start/ docker run with memlock 8 GiB + seccomp unconfined; loop
on SELECT 'firebolt-ready' until the engine returns the
sentinel (firebolt-core's HTTP port answers immediately
but returns 'Cluster not yet healthy' at HTTP 200 until
the engine threads have warmed)
check/ SELECT 1
load/ drop+create database clickbench, POST create.sql
(variant-specific: firebolt INSERTs into a managed
table, firebolt-parquet keeps the external table,
firebolt-parquet-partitioned uses the parquet glob)
query/ POST query to /?database=clickbench&output_format=JSON_Compact;
parse .statistics.elapsed for X-Query-Time
stop/ docker container stop
Each benchmark.sh now exports BENCH_DOWNLOAD_SCRIPT so
build-system-rootfs.sh stages hits.parquet (firebolt,
firebolt-parquet) or hits_*.parquet (firebolt-parquet-partitioned)
on the system disk.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…roxy + DNS Fixes the security advisories from the review pass: 1. aiohttp static handler: drop follow_symlinks=True. GHSA-5h86-8mv2-jq9f was a path-traversal in the static handler reachable only when symlinks were followed. The repo's web/ tree has no symlinks anyway, so this is pure attack-surface reduction. 2. TRUSTED_INTERNET set removed. clickhouse{,-parquet,-parquet-partitioned} and chdb{,-parquet,-parquet-partitioned} no longer get unrestricted internet at query time — they all run through the SNI-allowlist proxy now. A user SQL that asked clickhouse-client to fetch http://169.254.169.254/... can no longer reach the EC2 metadata service or any RFC1918 destination; only the S3 hosts in sni_proxy.DEFAULT_ALLOW survive. 3. SNI proxy / local DNS resolver bound to internal traffic only. New net.setup_host_firewall() installs INPUT rules accepting 8443/8080/53 only from the 10.200.0.0/16 TAP CIDR and loopback, then DROP for anything else. Called once at server startup. Without these rules the proxy was an open, unauthenticated S3 allowlist relay reachable from the public internet. 4. DNS via local resolver, UDP only. enable_filtered_internet now REDIRECTs the VM's UDP/53 to the host's local resolver and DROPs TCP/53 outright (no big-payload exfiltration channel via port 53). The previous ACCEPT-and-forward path is gone; the POSTROUTING MASQUERADE that supported it is no longer needed either since the SNI proxy opens its own outbound socket. 5. /api/admin/provision/{name} restricted to loopback callers. It re-runs install/start/load — can take hours per system — so anonymous internet callers triggering it would be a trivial DoS and lateral-movement risk. peer-IP check; behind a reverse proxy the proxy itself is the peer, which is fine (the proxy is part of the admin trust boundary). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Config field was advertised as a "concurrent live VMs cap" but nothing in vm_manager / monitor / main ever read it. Drop the dataclass field, the _env_int default, and the README row. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
aiohttp: - Add startup assertion: aiohttp >= 3.10 (covers GHSA-5h86-8mv2-jq9f static-handler path traversal fixed in 3.9.2, and the request- smuggling fixes in 3.9.4 / 3.10.x). Already true on the running host (Ubuntu's python3-aiohttp ships 3.13.3), but the assertion catches a future install on a stale image. - Add playground/requirements.txt with the pin for pip-based setups. systemd unit: - Drop in ProtectSystem=full, ProtectHome=read-only, ProtectKernelTunables/Modules/ControlGroups/Clock, PrivateTmp, RestrictAddressFamilies, LockPersonality, RestrictRealtime, RestrictNamespaces. - Explicit ReadWritePaths to /opt/clickbench-playground + ~/.cache (Python bytecode). - Comments explain what we DON'T set (NoNewPrivileges / RestrictSUIDSGID would break sudo, ProtectSystem=strict would break the privileged children, PrivateNetwork / PrivateDevices would break TAP + /dev/kvm). Rate limiting: - In-memory per-source-IP sliding-window counters on /api/query and /api/warmup: 200 req/min and 3000 req/hour. Returns 429 with Retry-After when exceeded. Both endpoints are unauthenticated; bound the damage a single bad actor can do (snapshot-restore spam, heavy-query loops). X-Forwarded-For honored for the leftmost hop if a reverse proxy is in front. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Honoring XFF without an authenticated reverse proxy in front lets any caller rotate the header value to forge a fresh IP for every request and bypass the bucket entirely. Drop it. If a reverse proxy is added later, that proxy is the trust boundary and its operator should either terminate the rate-limit there or extend this function to honor XFF only when the peer IP is the proxy's address. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DNS / dnsmasq:
- install-firecracker.sh installs and configures dnsmasq on
every non-loopback host address (port 53 UDP/TCP). The host's
systemd-resolved stays put on 127.0.0.53. iptables PREROUTING
REDIRECT for VM UDP/53 lands on a real listener now; before
this commit the host had no resolver bound to 10.200.x.1:53
and every VM DNS lookup just timed out (manifested as
'Not found address of host' from ClickHouse url() calls).
- net.setup_host_firewall hardens further: TCP/53 in INPUT is
loopback-only now (was internal-CIDR + loopback). VMs are
UDP-only for DNS at every layer.
Rate limiter:
- Add a bulk eviction sweep: when _rate_hits grows past 4096
entries, drop IPs whose newest hit is > 1h old (or whose
deque is empty). The previous code only checked for empty
deques, so one-shot IPs with a single in-window timestamp
accumulated forever. Sweep is amortized O(1) per request.
clickhouse-web:
- ClickHouse rejects filesystem-cache paths outside
/var/lib/clickhouse/caches/ (BAD_ARGUMENTS at CREATE TABLE).
Move the cache from /dev/shm/clickhouse to
/var/lib/clickhouse/caches/web. install + create.sql updated
together so the chown lands on the right path.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ClickHouse rejects any filesystem-cache path outside /var/lib/clickhouse/caches/ at CREATE TABLE time, but we still want the actual bytes in tmpfs — cold queries pull ~1 GB on first run and we'd rather not touch the SSD. Hand the engine a path that satisfies its prefix check (.../caches/web) but is itself a symlink into /dev/shm/clickhouse. ClickHouse only validates the configured string lexically; it doesn't canonicalise the target. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
config.py:
- New PLAYGROUND_TLS_CERT / PLAYGROUND_TLS_KEY / PLAYGROUND_TLS_PORT
env vars (default port 443). Empty cert path disables TLS.
main.py:
- When both cert+key are set, bind a second TCPSite on tls_port
with an SSLContext loading the cert chain. Plain port stays up
for loopback / behind-a-LB use.
clickbench-playground.service:
- SupplementaryGroups=ssl-cert so the unprivileged ubuntu user
can read /etc/letsencrypt/{live,archive}/.../privkey.pem.
- AmbientCapabilities=CAP_NET_BIND_SERVICE so the python process
can bind 443. Bounding set deliberately left at default — sudo
children still need the full cap set for iptables / ip tuntap.
install-firecracker.sh:
- When PLAYGROUND_TLS_DOMAIN is set, install certbot, acquire the
cert via --standalone (binds 80 briefly for HTTP-01), and drop
in a deploy hook that re-applies ssl-cert group perms on every
renewal so the privkey stays readable.
End-to-end verified:
curl https://clickbench-playground.clickhouse.com/api/state
-> HTTP 200, ssl_verify_result=0, CN matches, Let's Encrypt E8,
valid through 2026-08-12.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scrollbar-gutter: stable keeps space for the vertical scrollbar even when the rail's content fits without scrolling. Without it the rail visibly shrinks as rows finish, briefly pushing the right pane wide enough to trigger a horizontal scrollbar at the page level. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous start used `--rm` plus an anonymous volume mount, which meant the agent's pre-snapshot `./stop` (docker container stop) removed the container and discarded its volume. The snapshot then captured a freshly-started, empty firebolt-core, and every query post-restore returned Database 'clickbench' does not exist or not authorized. Drop --rm, bind-mount the engine data directory to a per-system fb-volume on the sysdisk, and make ./start re-use the existing container if it's already present (`docker start` instead of re-running `docker run`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
polars/server.py stores the scan_parquet LazyFrame in a module-level `hits` variable; /query returns 409 'DataFrame not loaded' when it is None. The agent's pre-snapshot stop+start cycle was wiping that variable: the snapshot captured a freshly-relaunched server, and the first query post-restore failed with the 409. Marking .preserve-state skips the stop+start so the snapshot ships the running server with `hits` already set. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous --silent PUT discarded a non-200 response from /api/v1/logstream/hits, then every subsequent /ingest POST 400'd 'stream not found' — the only visible evidence was 100k+ curl 400 lines that pushed everything else out of the agent's tail-only provision log buffer. Print the response, capture HTTP_CODE, and exit non-zero if it's not 200/201 so the actual cause surfaces. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The kinetica daemon runs inside a docker container with ./kinetica-persist bind-mounted, so a symlink pointing at $PWD/hits.tsv.gz dangles inside the container and the LOAD returns Not_Found: No such file(s) (File(s):hits.tsv.gz) The persist dir and $PWD live on the same overlay filesystem, so the mv is a rename — cheap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
server.py was discarding the eval()'d value and returning only
{"elapsed": ...}; the playground UI then displayed just the timing.
Stringify the result (polars DataFrame/Series/LazyFrame via __str__,
everything else via repr) and pass it back in a "result" field.
query script extracts result -> stdout, elapsed -> stderr.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
vm_manager._snapshot_disks now adds a compression pass after the reflink-clone: 1. cp --reflink=always working/* -> golden/* (cheap, as before) 2. zstd -1 -T0 --sparse golden/* -> golden/*.zst 3. unlink the uncompressed golden once .zst is written Two-step compress (no `zstd --rm`) so an interrupted run can't lose the only copy of the golden. Trades a 10-30 s restore-time decompression for ~30-60% smaller goldens; on the heaviest VM we have (duckdb-dataframe, 249 GB swap.golden.raw) zstd-1 sampled ~5.5x, so this is roughly the difference between fitting and not fitting the catalog on a 7 TB host. _restore_disks materializes the working disk from whichever form of golden exists — .zst (decompress, no reflink) or .ext4 / .raw (legacy reflink path, kept for backwards compatibility with old snapshots). _has_snapshot accepts either form. Plus a one-shot scripts/compress-goldens.sh that walks the state dir and converts existing uncompressed goldens, so operators don't have to wait for every system to be re-provisioned before the disk savings land. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.