Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions apps/pyth-data-puller/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# ClickHouse credentials (loaded via direnv from 1Password)
HOST=
USER=
PASSWORD=

# S3 configuration
S3_BUCKET=pyth-ch-share-public
S3_REGION=ap-northeast-1
S3_PREFIX=exports/pyth-dump
S3_ROLE_ARN=arn:aws:iam::ACCOUNT_ID:role/YOUR_ROLE_NAME
6 changes: 6 additions & 0 deletions apps/pyth-data-puller/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
/logs/
/data/
.env
.env.local
.envrc
/tmp/
84 changes: 84 additions & 0 deletions apps/pyth-data-puller/docs/S3_INDEX_HTML_ISSUE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# S3 index.html Upload Issue

## What's happening

The export pipeline writes CSV files directly from ClickHouse to S3 using
`INSERT INTO FUNCTION s3(...)`. This works perfectly — CSVs land in the bucket
without any issues.

However, after the CSV export finishes, the script tries to generate an
`index.html` manifest (a nice browsable page listing all exported files with
download links) and upload it to S3 using the same ClickHouse mechanism but
with `RawBLOB` format instead of `CSVWithNames`.

This `RawBLOB` upload consistently fails with `Access Denied` on the
`pyth-ch-share-public` bucket, even though the CSV uploads to the same
bucket and prefix succeed fine.

## Why it fails

The ClickHouse IAM role (`Clickhouse-pyth-lazer-FullAccessPublicShareS3`)
can write `CSVWithNames` data to S3 but gets a 403 when writing `RawBLOB`.
These two formats use different S3 API call patterns internally in ClickHouse,
and the IAM policy appears to restrict one but not the other.

We tried:
- Adding `SETTINGS s3_truncate_on_insert = 1` to skip the existence check — still 403
- Removing the `headers('Content-Type'='text/html')` parameter — still 403
- Writing to root-level path (not subfolders) — still 403

The CSV writes work because `CSVWithNames` with a `SELECT FROM table` query
goes through a different ClickHouse code path than `RawBLOB` with a
`SELECT base64Decode(...)` constant expression.

## Why this matters

Without the index.html, the S3 prefix URL returns an `AccessDenied` XML
error because S3 doesn't serve directory listings. The individual CSV files
are still there and downloadable with their direct URLs, but there's no way
for someone to browse and discover the files from a single link. This makes
it hard to share exports with external clients — you'd have to send them
each CSV URL individually.

## Current workaround

Index.html generation is disabled (`GENERATE_INDEX_HTML=0`). The dashboard
shows the S3 prefix URL, but clicking it won't show a file listing. To
share with external users, you need to share the direct CSV file URLs
(visible in the export logs).

## How to fix it

Any one of these would work, roughly ordered by effort:

1. **Fix the IAM policy** (preferred, no code changes) — Ask whoever manages
the AWS account to allow `s3:PutObject` for the ClickHouse role on all
keys under `exports/pyth-dump/`, not just CSV files. The role ARN is in
the `.envrc` config. This is the cleanest fix because ClickHouse handles
everything and no extra tooling is needed on the server.

2. **Install AWS CLI on the server** — Set `GENERATE_INDEX_HTML=1` and
`INDEX_CONTENT_TYPE_FIX_WITH_AWSCLI=1` in the export config. The script
will fall back to `aws s3 cp` for the HTML upload, bypassing ClickHouse
entirely. Requires `aws` CLI configured with credentials that have
`PutObject` access to the bucket.

3. **Upload via the Node.js app using @aws-sdk/client-s3** — Instead of
relying on ClickHouse or the AWS CLI, the Node.js web app could generate
the index.html and upload it directly to S3 using the AWS SDK after the
export script finishes. This would require adding `@aws-sdk/client-s3`
as a dependency and having AWS credentials available on the server (via
IAM instance profile, env vars, or shared credentials file). The app
already knows the exported file list from parsing the script output, so
it has everything needed to generate the manifest.

4. **Use a different bucket** — The original `data_dump` workflow used
`dourolabs-pyth-data-share` (eu-west-2) where uploads worked. If that
bucket is still available, switching to it may resolve the issue.

## Re-enabling index.html

Once any of the above is resolved, re-enable by setting
`GENERATE_INDEX_HTML=1` in `src/lib/export-runner.ts` and restoring the
metadata env vars (export name, channel label, feed labels) that were
removed from `buildEnvConfig()` in the same file.
6 changes: 6 additions & 0 deletions apps/pyth-data-puller/next-env.d.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
/// <reference types="next" />
/// <reference types="next/image-types/global" />
import "./.next/types/routes.d.ts";

// NOTE: This file should not be edited
// see https://nextjs.org/docs/app/api-reference/config/typescript for more information.
10 changes: 10 additions & 0 deletions apps/pyth-data-puller/next.config.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
/** @type {import('next').NextConfig} */
const config = {
reactStrictMode: true,
output: "standalone",
serverExternalPackages: ["better-sqlite3"],
logging: {
fetches: { fullUrl: true },
},
};
export default config;
30 changes: 30 additions & 0 deletions apps/pyth-data-puller/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"name": "@pythnetwork/pyth-data-puller",
"private": true,
"type": "module",
"version": "0.0.0",
"engines": {
"node": "^24.0.0"
},
"scripts": {
"build": "next build",
"start:dev": "next dev --port 3000",
"start:prod": "next start -H 127.0.0.1 --port 3000",
"test:types": "tsc"
},
"dependencies": {
"better-sqlite3": "^11.7.0",
"next": "catalog:",
"react": "catalog:",
"react-dom": "catalog:",
"zod": "catalog:"
},
"devDependencies": {
"@cprussin/tsconfig": "catalog:",
"@types/better-sqlite3": "^7.6.12",
"@types/node": "catalog:",
"@types/react": "catalog:",
"@types/react-dom": "catalog:",
"typescript": "catalog:"
}
}
Loading
Loading