Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
62c71b9
fix: stabilize douyin api task output and utf8 keywords
lsh-915 Jun 16, 2026
be8fbce
feat(scraper): standardize search outputs in task workspaces
lsh-915 Jun 22, 2026
00a4946
feat(scraper): add raw and cleaned comment outputs
lsh-915 Jun 22, 2026
8bee152
feat(scraper): add cleaned search title outputs
lsh-915 Jun 22, 2026
2d7afaf
feat(scraper): add script source raw and clean outputs
lsh-915 Jun 22, 2026
55199ea
feat(content-asset): add content asset builder
lsh-915 Jun 22, 2026
adcc7f2
feat(api): expose content asset result preview and export
lsh-915 Jun 22, 2026
11722ba
feat(web): establish Vite frontend baseline
lsh-915 Jun 22, 2026
31ceb01
chore(web): align frontend API and dev ports
lsh-915 Jun 22, 2026
f88a0ff
feat(web): add content asset preview and download UX
lsh-915 Jun 22, 2026
1f74686
docs(content-asset): document schema workflow and acceptance
lsh-915 Jun 22, 2026
5f4bce0
docs(tasks): archive T017-5 release isolation notes
lsh-915 Jun 22, 2026
906266b
fix(api): add missing API utility helpers
lsh-915 Jun 22, 2026
98168e9
chore(runtime): align ports and deployment scaffolding
lsh-915 Jun 22, 2026
0443712
chore(webui): refresh embedded frontend build artifacts
lsh-915 Jun 22, 2026
689e2fe
fix(douyin): stabilize CDP comment collection
lsh-915 Jun 22, 2026
da7abe6
chore(asr): add whisper dependency scaffolding
lsh-915 Jun 22, 2026
68b7f5b
test: add regression test harness
lsh-915 Jun 22, 2026
0e85d02
docs: reconcile architecture and operational notes
lsh-915 Jun 22, 2026
135e8af
test(api): add stable API route regression tests
lsh-915 Jun 22, 2026
4d8f2e8
ci: add stable regression workflow
lsh-915 Jun 22, 2026
45cac09
feat(api): list csv and jsonl files in task result
lsh-915 Jun 22, 2026
746a440
docs: add local runtime port notes
lsh-915 Jun 22, 2026
8639786
docs(api): expand operational guide
lsh-915 Jun 22, 2026
99149c4
chore(runtime): add container chromium launch args
lsh-915 Jun 22, 2026
69d27ec
docs(tasks): archive T017-1 CDP fix notes
lsh-915 Jun 22, 2026
c55a03d
Merge remote-tracking branch 'upstream/main' into fix/douyin-api-outp…
lsh-915 Jun 22, 2026
bdb2c80
fix(api): add minimal api key auth and restrict cors
lsh-915 Jun 22, 2026
74ecc28
test(ci): expose full regression baseline
lsh-915 Jun 23, 2026
1562b4a
fix(redis): align default redis configuration
lsh-915 Jun 23, 2026
01ece7d
chore(api): reconcile legacy crawler route contract
lsh-915 Jun 23, 2026
544afd1
fix(web): align health disk capacity fields
lsh-915 Jun 23, 2026
6752746
fix(logging): avoid closed stream errors during test shutdown
lsh-915 Jun 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Python
.venv/
__pycache__/
*.pyc
*.pyo
*.egg-info/
dist/
build/
*.egg

# Testing
.pytest_cache/
.mypy_cache/
htmlcov/
.coverage

# Project data
data/
workspaces/

# Git
.git/
.gitignore

# Docs
docs/
*.md
!README.md

# Environment
.env
.env.*

# IDE
.vscode/
.idea/
*.swp
*.swo

# Docker
docker-compose*.yml
Dockerfile
32 changes: 28 additions & 4 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,3 +1,23 @@
# MediaCrawler Port Configuration
# Container-internal API port remains 8000; host access uses 18080.
DY_API_HOST=127.0.0.1
DY_API_PORT=8000
DY_API_PUBLIC_PORT=18080
DY_API_BASE_URL=http://localhost:18080
# Set a long random value to enable REST and WebSocket authentication.
DY_API_KEY=
# Local direct-start default; Docker Compose overrides this to 1.
DY_API_AUTH_REQUIRED=0
# Defaults to the four local Web/API origins when omitted.
DY_CORS_ALLOW_ORIGINS=http://localhost:15173,http://127.0.0.1:15173,http://localhost:18080,http://127.0.0.1:18080

DY_CHROME_PORT=19222
DY_CHROME_CDP_URL=http://localhost:19222

WEB_DEV_PORT=15173
VITE_API_BASE_URL=http://localhost:18080
VITE_WS_BASE_URL=ws://localhost:18080

# MySQL Configuration
MYSQL_DB_PWD=123456
MYSQL_DB_USER=root
Expand All @@ -6,10 +26,14 @@ MYSQL_DB_PORT=3306
MYSQL_DB_NAME=media_crawler

# Redis Configuration
REDIS_DB_HOST=127.0.0.1
REDIS_DB_PWD=123456
REDIS_DB_PORT=6379
REDIS_DB_NUM=0
# Local Redis defaults to localhost with no password.
# Docker Compose overrides REDIS_HOST to the internal service name "redis".
# For secured deployments, inject REDIS_PASSWORD through a secret/environment
# and configure the Redis server with the same password. Do not expose port 6379.
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_DB=0

# MongoDB Configuration
MONGODB_HOST=localhost
Expand Down
63 changes: 63 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: CI

on:
push:
branches: [main, develop]
pull_request:
branches: [main]

jobs:
core-tests:
name: Core regression gate (89 tests)
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup Python 3.11
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: pip

- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends ffmpeg

- name: Install Python dependencies
run: |
pip install -r requirements.txt
pip install -r requirements-test.txt
pip install -r api/requirements.txt

- name: Run stable core regression
run: pytest douyin_scraper/tests/ api/tests.py -q

legacy-baseline:
name: Legacy baseline (known failures, non-blocking)
runs-on: ubuntu-latest
continue-on-error: true
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup Python 3.11
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: pip

- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends ffmpeg

- name: Install Python dependencies
run: |
pip install -r requirements.txt
pip install -r requirements-test.txt
pip install -r api/requirements.txt

- name: Run visible legacy baseline
run: pytest tests/ test/ -q
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -179,4 +179,13 @@ agent_zone
debug_tools

database/*.db
.omx/
.omx/

# Local frontend dependencies and workspace artifacts
web/node_modules/
*.tsbuildinfo
/workspaces/
state/
workspace_default/
.env.local
.env.*.local
63 changes: 63 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# MediaCrawler — 抖音关键词批量采集工具
# 简化构建:直接使用本地构建好的前端产物(api/webui/)
# 本地构建前端:cd web && npm run build

FROM python:3.11-slim

# Install system dependencies (Chromium for headless browsing, Node.js for execjs)
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
chromium \
chromium-driver \
nodejs \
npm \
libnss3 \
libnspr4 \
libdbus-1-3 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libxkbcommon0 \
libxcomposite1 \
libxdamage1 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libpango-1.0-0 \
libcairo2 \
libasound2 \
fonts-liberation \
fonts-noto-color-emoji \
&& rm -rf /var/lib/apt/lists/*

# Set Chromium as default browser for Playwright
ENV PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH=/usr/bin/chromium
ENV CHROME_PATH=/usr/bin/chromium

WORKDIR /app

# Copy dependency files first for cache efficiency
COPY requirements.txt ./requirements.txt
COPY api/requirements.txt ./api-requirements.txt
COPY pyproject.toml ./

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt \
&& pip install --no-cache-dir -r api-requirements.txt

# Copy source code
COPY . .

# Install douyin_scraper package
RUN pip install --no-cache-dir .

# Expose API port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD python -c "import httpx; httpx.get('http://localhost:8000/health')" || exit 1

# Start API server (also serves Web UI via StaticFiles at /ui/)
CMD ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8000"]
41 changes: 41 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
.PHONY: install-dev test test-core test-baseline test-all test-known-failures test-external test-unit clean

# Install development and test dependencies (requirements-based, no editable install)
install-dev:
pip install -r requirements.txt
pip install -r requirements-test.txt
pip install -r api/requirements.txt

# Stable merge-gate regression suite
test: test-core

test-core:
pytest douyin_scraper/tests/ -q
pytest api/tests.py -q

# Legacy baseline; a non-zero exit is expected while known failures remain
test-baseline:
pytest tests/ test/ -q

# Complete repository suite with external integrations skipped by default
test-all:
pytest douyin_scraper/tests/ api/tests.py tests/ test/ -q

# Run the tracked T021 known failures without converting them to xfail
test-known-failures:
pytest tests/ test/ -m known_fail -q

# Opt in to Redis, MongoDB, and real proxy-provider integration tests
test-external:
MEDIACRAWLER_RUN_EXTERNAL_TESTS=1 pytest test/ -m external -q

# Run unit tests only (skip integration)
test-unit:
pytest douyin_scraper/tests/ -q

# Clean build artifacts
clean:
find . -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null || true
find . -type f -name "*.pyc" -delete 2>/dev/null || true
find . -type d -name "*.egg-info" -exec rm -rf {} + 2>/dev/null || true
rm -rf .pytest_cache .mypy_cache htmlcov .coverage
20 changes: 16 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ uv run playwright install

1. **安装最新版 Chrome 浏览器**(版本 >= 144),[下载地址](https://www.google.com/chrome/)
2. **开启远程调试功能**:在 Chrome 地址栏输入 `chrome://inspect/#remote-debugging`,勾选 **"Allow remote debugging for this browser instance"**
3. 页面显示 `Server running at: 127.0.0.1:9222` 表示已就绪
3. 页面显示 `Server running at: 127.0.0.1:19222` 表示已就绪

> 💡 **提示**:运行爬虫后,Chrome 浏览器会弹出确认对话框,点击"接受"即可。程序会等待用户确认,60秒内操作完成即可。
>
Expand Down Expand Up @@ -160,14 +160,18 @@ MediaCrawler 提供了基于 Web 的可视化操作界面,无需命令行也
#### 启动 WebUI 服务

```shell
# 启动 API 服务器(默认端口 8080
uv run uvicorn api.main:app --port 8080 --reload
# 启动 API 服务器(宿主固定端口 18080;容器内部端口为 8000
uv run uvicorn api.main:app --port 18080 --reload

# 或者使用模块方式启动
uv run python -m api.main
```

启动成功后,访问 `http://localhost:8080` 即可打开 WebUI 界面。
启动成功后,访问 `http://localhost:18080/ui/` 即可打开 WebUI 界面。

如果服务通过 `.env` 配置了 `DY_API_KEY`(兼容 `API_KEY`),请在 WebUI 的“设置”页面填写同一个密钥。REST 请求使用 `X-API-Key` 请求头;Docker Compose 默认要求配置密钥,本机直启时留空只适合可信开发环境。

CORS 默认只允许 `15173` Web 开发端口和 `18080` API/WebUI 端口的 localhost/127.0.0.1 来源。局域网或公网部署应显式填写可信来源,不建议使用 `CORS_ALLOW_ORIGINS=*`。如果本地 `.env` 已保存非空 LLM/API Key,建议轮换并改用 secret 注入。完整说明见 [API 安全文档](api/README.md#api-key-鉴权)。

#### WebUI 功能特性

Expand Down Expand Up @@ -243,6 +247,14 @@ MediaCrawler 支持多种数据存储方式,包括 CSV、JSON、JSONL、Excel

📖 **详细使用说明请查看:[数据存储指南](docs/data_storage_guide.md)**

### Content Asset 内容资产表

T017-5 提供 `content_asset.jsonl` 和 `content_asset.csv`,用于汇总搜索、标题清洗、评论和文案数据,并通过状态字段标注真实评论、真实 ASR 与 fallback 的边界。

- API 宿主地址:`http://localhost:18080`
- 前端开发地址:`http://localhost:15173`
- 完整字段、接口和验收说明:[Content Asset 数据字典与验收说明](docs/CONTENT_ASSET.md)


[🚀 MediaCrawlerPro 重磅发布 🚀!更多的功能,更好的架构设计!开源不易,欢迎订阅支持!](https://github.com/MediaCrawlerPro)

Expand Down
Loading