Skip to content

Progress-infinitely/luda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Luda — Tourism Data Agent

A Text-to-SQL / Chat-Data system for the tourism industry. Users ask business questions in natural language; the system parses metrics, dimensions, time ranges, and filters, generates safe SQL queries, and returns structured data with visualization suggestions and natural language summaries.

Dual-Mode Engine

Luda ships with two execution modes, switchable via environment variable:

Mode Description When to use
legacy (default) Local semantic dictionary + SQL template matching. No LLM dependency, 100% deterministic. Baseline, fast iteration, no API key
langgraph 13-node LangGraph workflow with parallel retrieval, semantic filtering, SQL self-correction, and answer summarization. Full AI capability, requires LLM + embedding keys

LangGraph Architecture (13 nodes)

extract_keywords
    ↓
┌──────────────────────────────────────┐
│  recall_column  │  recall_value  │  recall_metric  │  (parallel)
└──────────────────────────────────────┘
    ↓
merge_retrieved_info
    ↓
┌──────────────────────────────────────┐
│  filter_table  │  filter_metric  │  (parallel)
└──────────────────────────────────────┘
    ↓
add_extra_context → generate_sql → validate_sql
    ↓
execute_sql → [pass] → summarize_result → END
            → [fail] → correct_sql → re-execute → summarize_result → END
  • Three-way parallel recall: Vector search (Qdrant) for schema entities + fuzzy text search (ES) for dimension values + metric definition lookup
  • Semantic filtering: LLM prunes recalled tables/metrics by relevance before SQL generation
  • SQL self-correction: On validation failure, the LLM corrects the SQL and retries (up to N iterations)
  • Answer summarization: Raw query results are rewritten into natural language with chart suggestions

Hybrid Retrieval (RAG)

  • Qdrant: Semantic vector recall for table fields, relationships, and business metrics
  • Elasticsearch: Fuzzy and full-text search for dimension values (destinations, merchants, product names)

Tech Stack

Layer Technology
Backend Python 3.12+ · FastAPI (async)
Orchestration LangGraph · LangChain
Database MySQL 8.0
Vector DB Qdrant
Text Search Elasticsearch 8.x
LLM / Embedding OpenAI-compatible API (DeepSeek / DashScope)
Package Manager uv
Frontend Vanilla HTML + JS
Testing pytest

Project Structure

旅答/
├── frontend/                      # Web UI (chat + visualization)
│   ├── index.html
│   ├── app.js
│   └── styles.css
├── docs/
│   ├── architecture.md            # Detailed architecture document
│   ├── api_contract.md            # API / SSE event contract
│   ├── metric_glossary.md         # Business metric definitions
│   ├── demo_questions.md          # Standard test question set
│   ├── frontend_integration.md    # Frontend SSE integration guide
│   └── handoff.md                 # Project handoff documentation
└── data/                          # Backend sub-project (own pyproject.toml)
    ├── app/
    │   ├── agent/                 # Legacy mode SQL template engine + LangGraph graph
    │   ├── api/                   # FastAPI route handlers
    │   ├── prompts/               # LLM prompt templates
    │   ├── scripts/               # Meta knowledge build scripts
    │   ├── services/              # Query service (SSE emission)
    │   └── ...
    ├── generate/                  # Business data generation engine
    ├── seeds/                     # Seed data
    ├── sql/                       # DDL scripts (meta + business)
    ├── tests/                     # Automated tests
    ├── docker/                    # Infrastructure configs (MySQL, Qdrant, ES)
    ├── init_db.py                 # Database initialization
    ├── pyproject.toml
    └── .env.example

data/ is a sub-project with its own pyproject.toml. All uv commands use --project data.

Quick Start

1. Start infrastructure

# MySQL 8 (host port 3307 → container port 3306)
docker run --name luda-mysql \
  -e MYSQL_ROOT_PASSWORD=123321 \
  -p 3307:3306 \
  -d mysql:8.0 \
  --default-authentication-plugin=mysql_native_password \
  --character-set-server=utf8mb4 \
  --collation-server=utf8mb4_unicode_ci

# Qdrant + Elasticsearch (for langgraph mode)
docker compose -f data/docker/docker-compose.retrieval.yaml up -d

2. Configure environment

cp data/.env.example data/.env

Edit data/.env:

# Database
DB_HOST=127.0.0.1
DB_PORT=3307
DB_USER=root
DB_PASSWORD=123321
DB_NAME=travel
APP_PORT=8000

# Agent mode: legacy (default) / langgraph
LUDA_AGENT_MODE=legacy

# Retrieval toggle (default: off; enable for langgraph mode)
RETRIEVAL_ENABLED=false

# LLM config (required for langgraph mode)
LLM_API_KEY=your-llm-key
LLM_BASE_URL=https://api.deepseek.com
LLM_MODEL=deepseek-chat

# Embedding config (required for langgraph mode)
EMBEDDING_API_KEY=your-embedding-key
EMBEDDING_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
EMBEDDING_MODEL=text-embedding-v4

3. Install dependencies & initialize

# Install Python dependencies
uv sync --project data

# Create meta database
docker exec -i luda-mysql mysql -uroot -p123321 < data/sql/meta.sql

# Initialize business database
uv run --project data python data/init_db.py

# Generate business data (Windows PowerShell)
$env:PYTHONPATH="data"
uv run --project data python -m generate.main --profile full

# Build meta knowledge + retrieval indexes (langgraph mode only)
uv run --project data python -m app.scripts.build_meta_knowledge

4. Start backend

# Windows PowerShell
$env:PYTHONPATH="data"
uv run --project data python -m app.main

# macOS / Linux
PYTHONPATH=data uv run --project data python -m app.main

5. Open frontend

Open frontend/index.html in a browser (or serve it via any static file server).

SSE Event Contract

The system streams progress and results via Server-Sent Events:

Event Description Key Fields
meta Request metadata trace_id, created_at, agent_mode
progress Node execution progress step, status, message
keywords Extracted keywords and entities keyword list
retrieval Retrieved tables, columns, metrics, values retrieval results
filtered_tables LLM-pruned table entities filtered list
filtered_metrics LLM-pruned metric entities filtered list
sql Generated SQL statement SQL string
sql_validation EXPLAIN safety check result validation status
sql_correction Self-correction log (on failure) correction details
result Structured tabular data rows, columns
chart Visualization suggestion config chart config
answer Natural language summary summary text
error Exception or safety block error message
done Workflow complete

SQL Safety Guardrails

All generated SQL passes through multi-layer safety checks before execution:

  1. Read-only filter: Only SELECT allowed. Blocks INSERT, UPDATE, DELETE, DROP, ALTER, TRUNCATE, and multi-statement queries.
  2. System schema protection: Blocks access to MySQL internal schemas (information_schema, etc.).
  3. Row limit enforcement: Rewrites unbounded SELECT * queries to include LIMIT.
  4. EXPLAIN validation: Executes EXPLAIN <SQL> on a read-only connection to verify syntax and catch join errors before returning results.

Testing

# Run agent regression tests
uv run --project data pytest data/tests/test_query_agent.py -v

# Run all backend tests
uv run --project data pytest data/tests -v

About

Natural language → SQL data analysis agent for tourism business. LangGraph 13-node workflow with hybrid retrieval (Qdrant + ES) and SSE streaming.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors