Skip to content
Open
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
222 changes: 222 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Apache Kyuubi — Agent Instructions

A guide for AI coding agents contributing to Apache Kyuubi. Pairs with `CONTRIBUTING.md`, the [Contributor Guide](https://kyuubi.readthedocs.io/en/master/contributing/code/index.html), and review-derived rules in `docs/design/merged_rules.md` (the source of truth when this file is silent).

## Pre-flight Checks

Before the first edit or test in a session:

1. Run `git remote -v`. An `upstream` (or equivalent) remote must point to `apache/kyuubi`; do not work from a fork-only checkout.
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated
2. If `<upstream>/master` is stale, `git fetch <upstream> master` before branching or rebasing.
3. Check `git status`. If the tree is dirty, ask the user to stash before any branch switch or release/RAT-style check.
4. Existing PR: resolve the branch via `gh api repos/apache/kyuubi/pulls/<num> --jq '.head.ref'`, then check it out.
5. New work: branch from `<upstream>/master`. Branch names like `kyuubi-NNNN-short-slug` are convention, not policy.
6. Confirm `git config user.email` matches an email on the GitHub account that will open the PR.
7. Keep private/ignored files outside the repo root — RAT scans generated local files even when `.gitignore`d.

## Architecture

Kyuubi is a multi-tenant gateway that fronts pluggable SQL engines (Spark, Flink, Hive, Trino, JDBC). The module split is load-bearing in code review.
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated

### Module Layout

- `kyuubi-server/` — gateway process. Must not depend on engine modules under `externals/` or implement engine-side runtime behavior.
- `kyuubi-common/` — shared service abstractions, config registry, session/operation base classes, relocated Hive Thrift RPC.
- `kyuubi-ha/`, `kyuubi-events/`, `kyuubi-metrics/`, `kyuubi-rest-client/`, `kyuubi-zookeeper/` — focused libraries.
- `kyuubi-ctl/` — admin CLI.
- `kyuubi-hive-jdbc/`, `kyuubi-hive-beeline/` — JDBC client and shell.
- `externals/kyuubi-{spark,flink,hive,trino,jdbc,data-agent}-sql-engine/` — engine processes. Must not depend on each other.
- `extensions/{server,spark,flink}/` — opt-in plug-ins.
- `integration-tests/` — cross-module suites.
- `kyuubi-assembly/` — binary distribution.

### Hard Boundaries

- Server must not depend on Kyuubi engine modules or use engine runtime APIs (Spark/Flink internals) to implement engine behavior server-side.
- Protocol/client libraries are not a blanket exception — if one is needed in `kyuubi-server`, keep it limited to wire-protocol or client-model translation and verify the dep tree.
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated
- Engines must not depend on sibling engines (e.g. no `kyuubi-flink-sql-engine` → `kyuubi-spark-sql-engine`).
- Public APIs are abstract over the cluster manager. Use `killApplication`, not `closeYarnJob`; APIs, classes, and config keys must work for YARN, Kubernetes, and future managers.
- Spark SQL syntax extensions need maintainer agreement, version-aware parser support, and a clear reason they cannot stay upstream-compatible.
- Do not break the Thrift wire protocol. Kyuubi speaks Hive TCLIService through relocated dependencies; Python bindings are patched under `python/scripts/thrift-patches/`. Never remove or renumber wire fields.

### High-Sensitivity Areas

Get reviewer attention before changing:

- `python/scripts/thrift-patches/` — patches against upstream Hive TCLIService schema.
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated
- `kyuubi-common/.../session/SessionManager`, `.../operation/OperationManager` — lifecycle, concurrency, shutdown.
- `kyuubi-server/.../engine/KubernetesApplicationOperation` — cluster-manager integration; tests must not assume a real cluster.
- `kyuubi-common/.../config/KyuubiConf` — config registry; changes require regenerating `settings.md`.
- `kyuubi-server/.../api/v1/` — public REST surface; add auth checks before exposing.

## Build and Test

Use the bundled Maven wrapper (`build/mvn`).

```
build/mvn -Pfast clean package -DskipTests # local compile, skips tests/style/docs/RAT/downloads
build/mvn clean package -pl kyuubi-common -DskipTests
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated
build/mvn clean install # all tests
build/mvn clean install -pl kyuubi-common # one module's tests
build/mvn test -pl kyuubi-server -Dtest=none \
-DwildcardSuites=org.apache.kyuubi.server.api.v1.SessionsResourceSuite
build/mvn test -pl kyuubi-hive-jdbc -Dtest=KyuubiStatementTest -DwildcardSuites=none
```

Integration-style tests that exercise a packaged engine require `build/dist` first.

### Engine profile matrix

| Profile | Notes |
|---|---|
| `-Pspark-3.5` (default), `-Pspark-{3.3,3.4,4.0,4.1,master}` | Spark version |
| `-Pflink-1.20` (default), `-Pflink-{1.17,1.18,1.19}` | Flink version |
| `-Pscala-2.13` | Scala 2.13 (default is 2.12) |
| `-P{spark,flink,hive}-provided` | skip bundled engine downloads |
| `-Pmirror-cdn` | use Apache mirror CDN for engine archives |
| `-Pfast` | skip tests/style/docs/enforcer/RAT/downloads |

When engine code varies across versions, gate source/binary differences by Maven profile and runtime capability differences by feature detection — not by parsing version strings.

### Style, deps, docs

```
dev/reformat # Spotless (+ Python Spotless if `black` is installed); run before every commit
build/dependency.sh [--replace] # detect / update dev/dependencyList drift after dep changes
dev/gen/gen_all_config_docs.sh # regenerate docs/configuration/settings.md after KyuubiConf changes
```

Runtime dep changes may also require `LICENSE-binary` and `NOTICE` updates.
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated

## Coding Conventions

### Languages

- Prefer Java for new cross-module business logic. Use Scala for Spark/Flink extension internals, Scala test suites, and APIs already dominated by Scala types.
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated
- Scala follows the Scalafmt/Scalastyle config; Java follows Spotless/google-java-format. `dev/reformat` enforces both.

### Scala

- `Option[T]`, never `null`.
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated
- Pattern-match guards (`case X(...) if cond =>`) over nested `if`/`match`.
- `()` on no-arg methods with side effects; omit on pure methods.
- Early return for guard conditions over nested `if-else`.
- Avoid new `SparkSession.active` lookups in production code; pass `SparkSession` explicitly unless a Spark entry point gives no parameterized path.

### Naming

- Names describe purpose, not type. Avoid generic terms like `line`, `clue`, `principal`.
- No abbreviations (`currentUser`, not `currUser`).
- Avoid generic factories like `from(...)` — prefer `fromConfig`, `fromUserInput`.
- Names must reflect scope: a connection-level counter is not `totalRows` if it counts one operation.
- No magic numbers/strings. Use named constants; for enums use `MyEnum.X.toString`.

### Comments

- Explain why, not what. Self-explanatory code needs no comment.
- Comment intentional omissions and surprising decisions.
- TODO/FIXME should link a tracking issue when the work is deferred.
Comment thread
wangzhigang1999 marked this conversation as resolved.

### Error handling and logging

- Throw for unsupported features and invalid configs; do not silently ignore.
- Do not introduce direct `process.destroy()` / `destroyForcibly()` for engine shutdown — notify engines so their shutdown hooks run.
- Catch the specific exception (`IOException`, `JsonProcessingException`), not its parent.
- Log levels: `error` = actionable failure, `warn` = recoverable degradation, `info` = lifecycle, `debug` = diagnostics.
- SLF4J placeholders, not concatenation: `log.info("opened session {}", handle)`.

## Configuration

Configs are `ConfigEntry` instances in `KyuubiConf` or per-engine config classes.

- Server-only entries call `.serverOnly` on the builder; `KyuubiConf` tracks them in `serverOnlyConfEntries` so they are stripped before engine-side propagation.
- `version()` on `ConfigEntry` is the release the key first ships in.
- Namespace matches scope: `kyuubi.operation.*`, `kyuubi.session.*`, `kyuubi.<engine>.*`.
- Provide a sensible default; if none is safe, leave it undefined and document the requirement.
- Initialize config-dependent fields in `initialize(conf)`, not at declaration time.
- Reuse `fallbackConf` to share defaults across frontends; do not copy identical entries.
- After any entry change, run `dev/gen/gen_all_config_docs.sh` and commit the diff.

## Testing

- Tests ship in the same PR as the feature. "Tests will follow" is rejected.
- Place tests in the existing suite for the area; create a new suite only when the domain is genuinely new.
- Use `withSessionConf` from `HiveJDBCTestHelper` for per-test config overrides; do not mutate shared fixtures.
- Scala tests use ScalaTest style — no JUnit `@Test` annotations.
- Java tests follow surrounding module style (JDBC tests use JUnit 4: `Assert.assertEquals`, `Assert.assertThrows`).
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated
- A test that still passes when the change under test is reverted is not a test. Verify meaningful failure locally when practical.
- Do not add new `Utils.isTesting` branches in production code; use test config or build properties.
- Benchmarks belong in the Spark benchmark framework, not regular suites.

## PR Workflow

PR title: `[KYUUBI #NNNN][COMPONENT] Short imperative summary`. `NNNN` is the linked issue number; `[COMPONENT]` is optional but encouraged (use existing tags from recent `git log` for precedent). Use `[WIP]` while iterating. Update title and description if scope expands — both flow into the merge commit.

Fill in `.github/PULL_REQUEST_TEMPLATE`:

- **Why are the changes needed?** — motivation. Reviewers read the diff for "what".
Comment thread
wangzhigang1999 marked this conversation as resolved.
- **How was this patch tested?** — concrete commands or manual steps; explain if tests were not added.
- **Was this patch authored or co-authored using generative AI tooling?** — fill the field.

One concern per PR. Refactors go in their own PR; if an unrelated nit catches your eye, open a separate PR or follow-up issue.

### Investigating CI failures

Use scoped `gh run` commands instead of downloading full logs first.

```
gh run list -R apache/kyuubi -b <pr-branch> -L 5
gh run view <run-id> -R apache/kyuubi --log-failed
gh run view <run-id> -R apache/kyuubi -j <job-id> --log-failed
gh run download <run-id> -R apache/kyuubi -n <unit-tests-log-...> # full logs of a failed Kyuubi+Spark job
```

## AI-Assisted Contributions

Allowed under ASF guidance and Kyuubi's PR template.

- Disclose tools via `Generated-by: <tool name and version>` in the PR template field.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is actually not a good practice; we borrowed it from Spark, which was derived from ASF guidance

When providing contributions authored using generative AI tooling, a recommended practice is for contributors to indicate the tooling used to create the contribution. This should be included as a token in the source control commit message, for example including the phrase “Generated-by: ”.

https://www.apache.org/legal/generative-tooling.html

I think this is a misinterpretation of that, Generated-by: used to be an informal git commit message tag proposed by the Linux kernel community, and never got adopted. It is also misleading - we usually use LLM and Agent for not only code generation, but also code review, code reading, solution exploring, etc., the Generated-by: makes me feel like the whole patch is generated by the AI and I am just a transporter delivering it to GitHub.

Instead, I suggest we update this to adopt the Assisted-by: tag, which has been adopted by the Linux kernel community.

torvalds/linux@78d979d
https://docs.kernel.org/process/coding-assistants.html

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on dropping Generated-by:. The thing I'm not sure about is what to put after it.

I went through all 221 real Assisted-by: trailers in torvalds/linux — formats are pretty messy (claude-opus-4-6 / 4.6 / just Claude...), and around 17% use the [TOOL] slot, almost always for coccinelle or kconfiglint, neither of which we run.

So three ways we could go:

  1. Copy kernel as-is: Assisted-by: <AGENT>:<MODEL> [TOOL1] [TOOL2]. My worry is that "tool" in an AI context easily reads as "agent tool call" (Read/Edit/Bash), and people will start dumping their IDE or formatter in there.

  2. Same thing but drop the tool slot: Assisted-by: <AGENT>:<MODEL>. This is actually what ~80% of real kernel commits look like anyway, so we'd be matching what people actually do, not what the spec says.

  3. Just rename the field, keep the value free-form: Assisted-by: <tool name and version>. Smallest change to the existing PR template.

I lean toward 2. WDYT?

(Whichever we pick I'll update .github/PULL_REQUEST_TEMPLATE in this PR too.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the option 2

- Do not use `Co-Authored-By: <AI tool>`.
- The human author is responsible for the patch and must review every line.
- No AI self-references in source, comments, commit messages, or PR titles — disclosure stays in the PR description.
- Verify license/terms compatibility for generated content.

## Boundaries

### Never

- Push directly to `apache/kyuubi`. All changes go through a PR.
- Force-push to a shared branch. Force-push to your own PR branch is fine when coordinated.
- Break Thrift wire compatibility. Adding optional fields is less risky than removing/renumbering, but any wire change needs reviewer attention.
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated
- Commit `kyuubi_state_store.db`, logs, generated local configs, IDE files, credentials, or other private artifacts.
- Skip `dev/reformat` before committing.
- Introduce new direct `process.destroy()` / `destroyForcibly()` for engine shutdown.
- Introduce new `SparkSession.active` lookups in production code without a reviewer-approved reason.

### Ask first

- New runtime dependencies, especially with non-ASL2-compatible licenses.
- Public API changes (`kyuubi-common`, REST endpoints, Thrift IDL).
- Cross-engine refactors that touch more than one `externals/*` module.
- New configuration entries that gate user-visible behavior.
- New subsystems or large architectural shifts — discuss on the dev mailing list when scope warrants community design input.
Comment thread
wangzhigang1999 marked this conversation as resolved.
Outdated
- Anything in High-Sensitivity Areas above.

For anything not covered here, see the [Contributor Guide](https://kyuubi.readthedocs.io/en/master/contributing/code/index.html) or ask on the [Apache Kyuubi mailing list](https://kyuubi.apache.org/mailing_lists.html).
Loading