Text-to-SQL Fine-Tune

Fine-tuning small open Qwen2.5-Coder models for natural-language → SQL, evaluated with the official Spider / test-suite evaluator, to answer: what actually moves accuracy on small text-to-SQL models — model size, fine-tuning, or how you represent the schema?

Three ablation axes: schema representation, fine-tuning (base/few-shot/LoRA), and model size (1.5B/3B/7B). The size ladder is one axis, not the whole thesis.

Project #2 of an LLM/NLP arc. Project #1 — the Biomedical RAG Agent — was about grounding and retrieval; this one is about fine-tuning and the accuracy/size frontier. The through-line is rigorous, objective evaluation.

Approach

Spider (NL + schema → SQL) → LoRA/QLoRA SFT (unsloth) on 1.5B/3B/7B
    → generate SQL → execute against the DB → execution accuracy
    → base vs fine-tuned vs frontier (via LLMGateway) → accuracy-vs-size curve

Results

Full writeup: REPORT.md. Narrative version: blog post.

Size + fine-tuning (full Spider dev, n=1034, execution accuracy, 95% bootstrap CI):

config	tier	execution acc	95% CI
1.5B base	open	0.558	[0.527, 0.587]
1.5B fine-tuned	open	0.629	[0.599, 0.659]
3B base	open	0.691	[0.664, 0.721]
7B base	open	0.793	[0.766, 0.817]
claude-haiku-4-5	frontier	0.794	[0.768, 0.819]
gpt-4.1-mini	frontier	0.811	[0.787, 0.836]

Findings:

Size dominates — 0.56 → 0.69 → 0.79, all CIs non-overlapping.
Open matches frontier — the open 7B (0.79) is statistically tied with Claude-haiku (0.79) and within noise of GPT-4.1-mini (0.81). Frontier-tier text-to-SQL runs locally.
A one-epoch LoRA on the 1.5B is a significant +7 pts (0.56→0.63), closing ~half the gap to a 2×-larger 3B base — the efficiency story.
Schema representation (ablation chart, n=100): adding column types hurt execution; PK/FK keys helped exact-match; minimal already hits ~0.91 on the 7B.
Rigor caught a wrong conclusion: at n=100 the fine-tuning effect looked flat; on full dev with CIs it's clearly positive. Small slices lie — hence the CIs.

Run it

bash scripts/prepare_data.sh                                    # Spider DBs + tables + evaluator (WSL2)
PYTHONPATH=src python scripts/run_ablation.py --n 100 --model Qwen/Qwen2.5-Coder-7B-Instruct
PYTHONPATH=src python scripts/run_finetune.py --model Qwen/Qwen2.5-Coder-1.5B-Instruct
PYTHONPATH=src python scripts/run_ladder.py && PYTHONPATH=src python scripts/plot_ladder.py

Trains locally on an RTX 4090 (24GB, Ada) under WSL2 Ubuntu (Python 3.12, env via uv). See SPEC.md for milestones and docs/stack_derisk.md for the verified stack.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
data		data
docs		docs
results		results
scripts		scripts
src/text2sql		src/text2sql
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
REPORT.md		REPORT.md
SECURITY.md		SECURITY.md
SPEC.md		SPEC.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-to-SQL Fine-Tune

Approach

Results

Run it

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text-to-SQL Fine-Tune

Approach

Results

Run it

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages