Skip to content

Lawson-Darrow/Text-to-SQL-Finetune

Text-to-SQL Fine-Tune

Fine-tuning small open Qwen2.5-Coder models for natural-language → SQL, evaluated with the official Spider / test-suite evaluator, to answer: what actually moves accuracy on small text-to-SQL models — model size, fine-tuning, or how you represent the schema?

Three ablation axes: schema representation, fine-tuning (base/few-shot/LoRA), and model size (1.5B/3B/7B). The size ladder is one axis, not the whole thesis.

Project #2 of an LLM/NLP arc. Project #1 — the Biomedical RAG Agent — was about grounding and retrieval; this one is about fine-tuning and the accuracy/size frontier. The through-line is rigorous, objective evaluation.

Approach

Spider (NL + schema → SQL) → LoRA/QLoRA SFT (unsloth) on 1.5B/3B/7B
    → generate SQL → execute against the DB → execution accuracy
    → base vs fine-tuned vs frontier (via LLMGateway) → accuracy-vs-size curve

Results

Full writeup: REPORT.md. Narrative version: blog post.

Size + fine-tuning (full Spider dev, n=1034, execution accuracy, 95% bootstrap CI):

accuracy vs size

config tier execution acc 95% CI
1.5B base open 0.558 [0.527, 0.587]
1.5B fine-tuned open 0.629 [0.599, 0.659]
3B base open 0.691 [0.664, 0.721]
7B base open 0.793 [0.766, 0.817]
claude-haiku-4-5 frontier 0.794 [0.768, 0.819]
gpt-4.1-mini frontier 0.811 [0.787, 0.836]

Findings:

  1. Size dominates — 0.56 → 0.69 → 0.79, all CIs non-overlapping.
  2. Open matches frontier — the open 7B (0.79) is statistically tied with Claude-haiku (0.79) and within noise of GPT-4.1-mini (0.81). Frontier-tier text-to-SQL runs locally.
  3. A one-epoch LoRA on the 1.5B is a significant +7 pts (0.56→0.63), closing ~half the gap to a 2×-larger 3B base — the efficiency story.
  4. Schema representation (ablation chart, n=100): adding column types hurt execution; PK/FK keys helped exact-match; minimal already hits ~0.91 on the 7B.
  5. Rigor caught a wrong conclusion: at n=100 the fine-tuning effect looked flat; on full dev with CIs it's clearly positive. Small slices lie — hence the CIs.

Run it

bash scripts/prepare_data.sh                                    # Spider DBs + tables + evaluator (WSL2)
PYTHONPATH=src python scripts/run_ablation.py --n 100 --model Qwen/Qwen2.5-Coder-7B-Instruct
PYTHONPATH=src python scripts/run_finetune.py --model Qwen/Qwen2.5-Coder-1.5B-Instruct
PYTHONPATH=src python scripts/run_ladder.py && PYTHONPATH=src python scripts/plot_ladder.py

Trains locally on an RTX 4090 (24GB, Ada) under WSL2 Ubuntu (Python 3.12, env via uv). See SPEC.md for milestones and docs/stack_derisk.md for the verified stack.

License

MIT — see LICENSE.

About

Fine-tuning small open models (Qwen2.5-Coder) for text-to-SQL; execution-accuracy eval across a size ladder vs a frontier baseline.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors