Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"name": "opendataloader-pdf",
"owner": {
"name": "OpenDataLoader Project"
},
"metadata": {
"description": "AI-powered PDF extraction guidance and automation",
"version": "1.0.0"
},
"plugins": [
{
"name": "odl-pdf-skills",
"description": "Expert guidance for opendataloader-pdf — environment detection, option recommendations, hybrid mode setup, quality diagnostics, and direct conversion execution",
"source": "./",
"skills": [
"./skills/odl-pdf"
]
}
]
}
36 changes: 36 additions & 0 deletions .github/workflows/skill-drift-check.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# skill-drift-check.yml
# Ensures skill references stay in sync with options.json when CLI options change.
# Runs sync-skill-refs.py and fails the check if drift is detected (exit code 1).

name: Skill Drift Check

on:
push:
paths:
- 'options.json'
pull_request:
paths:
- 'options.json'
Comment thread
coderabbitai[bot] marked this conversation as resolved.
workflow_dispatch:

jobs:
check-drift:
runs-on: ubuntu-latest
Comment thread
coderabbitai[bot] marked this conversation as resolved.
steps:
- uses: actions/checkout@v4

- uses: actions/setup-python@v5
with:
python-version: '3.12'

- name: Check skill drift
run: |
set +e
python skills/odl-pdf/scripts/sync-skill-refs.py
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
echo ""
echo "Drift detected: skill references are out of sync with options.json."
echo "Update skills/odl-pdf/references/options-matrix.md to match options.json."
exit 1
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Comment thread
coderabbitai[bot] marked this conversation as resolved.
fi

Check warning

Code scanning / CodeQL

Workflow does not contain permissions Medium

Actions job or workflow does not limit the permissions of the GITHUB_TOKEN. Consider setting an explicit permissions block, using the following as a minimal starting point: {contents: read}
Comment thread
github-advanced-security[bot] marked this conversation as resolved.
Fixed
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -75,4 +75,3 @@ logs/
# Configuration files
.claude/settings.local.json
.claude/plans/

12 changes: 12 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,15 @@ Hidden text detection (`--filter-hidden-text`) is **off by default** — it requ
- `./scripts/bench.sh --check-regression` — CI mode with threshold check
- Benchmark code lives in [opendataloader-bench](https://github.com/opendataloader-project/opendataloader-bench)
- Metrics: **NID** (reading order), **TEDS** (table structure), **MHS** (heading structure), **Table Detection F1**, **Speed**

## Agent Skills

`skills/odl-pdf/` contains the public agent skill shipped with this project.

When adding or changing CLI options in Java:
1. Run `npm run sync` (regenerates options.json + Python/Node bindings)
2. Update `skills/odl-pdf/references/options-matrix.md` with the new option
3. CI (`skill-drift-check.yml`) will warn if step 2 is missed

The skill is written in English for external users. Do not include internal
team terminology or company-specific policies.
14 changes: 14 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,5 +134,19 @@ git commit -s -m "your message"

Make sure your Git config contains your real name and email.

## Agent Skills Maintenance

This project ships a built-in agent skill at `skills/odl-pdf/`. When you add
or modify CLI options:

1. Run `npm run sync` as usual
2. Update `skills/odl-pdf/references/options-matrix.md` — add the new option
to the appropriate category with its type, default, and description
3. If the new option has interaction rules with existing options (e.g., requires
another option to be set), document the rule in the "Interaction Rules" section

The CI workflow `skill-drift-check.yml` will flag any mismatch between
`options.json` and `options-matrix.md`.

Thank you again for helping us improve this project! 🙌
If you have any questions, open an issue or join the discussion.
25 changes: 25 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,31 @@ Existing PDFs (untagged)

[PDF Accessibility Guide](https://opendataloader.org/docs/accessibility-compliance)

## Agent Skills

Your AI coding agent knows how to use opendataloader-pdf — optimal options,
hybrid mode setup, and quality diagnostics, all handled automatically.

Works with **Claude Code**, **Codex**, **Gemini CLI**, **Cursor**, **VS Code**, and 26+ platforms via [agentskills.io](https://agentskills.io) spec.

### What the Skill Does

| Phase | Description |
|-------|-------------|
| **Discover** | Detects your OS, Java, Python, Node.js, and ODL installation |
| **Prescribe** | Recommends optimal install method, options, format, and mode |
| **Execute** | Generates ready-to-run commands or runs conversions directly |
| **Diagnose** | Identifies quality issues and escalates (local → cluster → hybrid) |
| **Optimize** | Tunes batch processing, RAG integration, and performance |

### Install

```bash
npx skills add opendataloader-project/opendataloader-pdf --skill odl-pdf
```

Or use the `/odl-pdf` slash command in Claude Code after installing the plugin.

## Roadmap

| Feature | Timeline | Tier |
Expand Down
166 changes: 166 additions & 0 deletions skills/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Agent Skills

opendataloader-pdf ships built-in agent skills that help AI coding assistants use this project effectively. Skills follow the [agentskills.io](https://agentskills.io) specification and work with Claude Code, Codex, Gemini CLI, Cursor, VS Code, and 26+ platforms.

## Directory Structure

```
Comment thread
coderabbitai[bot] marked this conversation as resolved.
skills/
├── README.md ← You are here
└── odl-pdf/ ← One skill per directory
├── SKILL.md ← Main skill file (loaded when activated)
├── references/ ← Deep-dive docs (loaded on demand)
│ ├── options-matrix.md
│ ├── hybrid-guide.md
│ ├── format-guide.md
│ ├── installation-matrix.md
│ └── eval-metrics.md
├── scripts/ ← Executable helpers
│ ├── detect-env.sh
│ ├── hybrid-health.sh
│ ├── quick-eval.py
│ └── sync-skill-refs.py
└── evals/ ← Quality test cases
└── evals.json
```

## How Skills Work

### Progressive Disclosure (3 Levels)

| Level | Content | When Loaded |
|-------|---------|-------------|
| **L1** | `description` field in SKILL.md frontmatter (~100 words) | Always visible to skill router |
| **L2** | SKILL.md body (~400 lines) — persona, workflows, decision trees, gotchas | When skill is activated |
| **L3** | `references/*` files — detailed option matrices, guides, metrics | When the user enters that topic |

This design minimizes token usage. The AI agent only loads what it needs for the current task.

### Dual-Path Option Reference

Skills must work for **both** source-code users and pip-install users:

- **Built-in summaries** (`references/options-matrix.md`): Always available, even without source code
- **Dynamic reference** (`options.json`): Authoritative source when the source repo is available

SKILL.md instructs the AI: "If `options.json` exists in this project, it is the source of truth. Options in `options.json` not found in `options-matrix.md` are newly added."

## Creating a New Skill

### 1. Create the Directory

```
skills/my-skill/
├── SKILL.md
├── references/ (optional)
├── scripts/ (optional)
└── evals/ (optional)
```

### 2. Write SKILL.md

The SKILL.md file has two parts:

**Frontmatter** (YAML between `---` markers):

```yaml
---
name: my-skill
description: >
One paragraph (~100 words) explaining what this skill does.
Include trigger keywords so the skill router knows when to activate.
Include "Do NOT use for:" to prevent false activations.
---
```

**Body** (Markdown):

- Define a persona (who the AI becomes when this skill is active)
- Define a workflow (numbered phases the AI follows)
- Include decision trees for common choices
- List critical gotchas the AI must always warn about
- Reference deeper docs with: "See `references/filename.md` for details"

### 3. Write Evals

Create `evals/evals.json` with test scenarios:

```json
{
"version": "1.0",
"skill": "my-skill",
"evals": [
{
"id": "eval-001",
"scenario": "Description of the user's situation",
"user_input": "What the user says",
"expected_recommendations": ["What the AI should recommend"],
"must_mention": ["Required terms in the response"],
"must_not_mention": ["Forbidden terms"]
}
]
}
```

### 4. Register in marketplace.json

Add your skill to `.claude-plugin/marketplace.json`:

```json
{
"plugins": [{
"skills": ["./skills/odl-pdf", "./skills/my-skill"]
}]
}
```

### 5. Test

Test by spawning an AI agent that knows nothing about the project, loading only your SKILL.md, and asking it the eval scenarios. All `must_mention` terms should appear; no `must_not_mention` terms should appear.

## Modifying the Existing Skill

### When CLI Options Change

1. Run `npm run sync` (regenerates `options.json`)
2. Update `skills/odl-pdf/references/options-matrix.md` — add the new option to the appropriate category
3. If the option has interaction rules, document them in the "Interaction Rules" section
4. CI (`skill-drift-check.yml`) will catch any mismatch you miss

### When Adding a New Hybrid Backend

1. Update `skills/odl-pdf/references/hybrid-guide.md` — add to the Backend Registry table
2. SKILL.md's decision tree says "check `options.json` for allowed hybrid values" — new backends are auto-discovered

### When Adding a New Output Format

1. Update `skills/odl-pdf/references/format-guide.md` — add to the format table with downstream use mapping
2. The format list in `options.json` is auto-discovered by the skill

## CI Integration

### Drift Check (`skill-drift-check.yml`)

Runs automatically when `options.json` changes. Compares option names in `options.json` against `options-matrix.md` and fails if they diverge.

Run manually:

```bash
python skills/odl-pdf/scripts/sync-skill-refs.py
```

## Writing Guidelines

- **Language**: English only (external open-source users)
- **No internal terminology**: No company names, team names, or internal tool references
- **Tone**: Senior engineer pair-programming — diagnose first, prescribe later
- **Java guidance**: Always mention Java 11+ requirement. Never recommend specific JDK distributions or download links.
- **Gotchas**: Only include gotchas that affect external users. Internal development gotchas belong in CLAUDE.md.

## References

- [agentskills.io specification](https://agentskills.io) — Multi-agent skill format standard
- [Claude Code Skills](https://docs.anthropic.com/en/docs/claude-code) — Claude Code skill documentation
- `.claude-plugin/marketplace.json` — Plugin registration for this project
- `CLAUDE.md` — Internal development notes (not for the skill)
- `CONTRIBUTING.md` — Contributor guidelines including skill maintenance
Loading
Loading