[KYUUBI #7445] Add AGENTS.md to guide AI coding agents#7452
[KYUUBI #7445] Add AGENTS.md to guide AI coding agents#7452wangzhigang1999 wants to merge 11 commits into
Conversation
Verified all commands, file paths, class names, and Maven profiles against the actual repository. Corrected the following: - Thrift IDL is upstream (Hive TCLIService via kyuubi-relocated-thrift); rewrote the wire-format references accordingly. Only Kyuubi-side delta is python/scripts/thrift-patches/. - SessionManager and OperationManager live in kyuubi-common, not kyuubi-server. Corrected high-sensitivity area paths. - Replaced KyuubiBaseConnectionTest (does not exist) with the actual KyuubiStatementTest example. - Replaced PlanOnlyStatementSuite (does not exist) with PlanOnlyOperationSuite. - Replaced the broad "KYUUBI_UPDATE=1 build/mvn clean test" with dev/gen/gen_all_config_docs.sh, which is the scoped wrapper that actually regenerates settings.md. - Rewrote the CI failure investigation section: Kyuubi CI does not emit per-test check-run annotations the way Spark CI does, so the recipe now uses gh run --log-failed and artifact download. - Enumerated the externals/* engine modules explicitly instead of relying on a glob pattern that did not cover kyuubi-trino-engine.
Verified against root pom.xml, kyuubi-common/pom.xml, ConfigBuilder.scala, and live source. Corrections: - Removed externals/kyuubi-chat-engine reference; the CHAT engine was removed in Kyuubi 1.12 (per docs/deployment/migration-guide.md). - Rephrased the engine profile matrix: only Spark and Scala have version profiles in pom.xml; Flink (1.20.3), Hive (3.1.3), and Trino client (411) are pinned to single versions. - Java tests use JUnit 4 (no AssertJ dependency in any pom.xml); corrected the testing section accordingly. - kyuubi-common does not contain in-tree Thrift IDL; described it as hosting relocated Hive Thrift RPC types instead. - Distinguished integration-style tests from unit tests for build/dist prerequisite (per docs/contributing/code/testing.md). - Server-only configs are declared via .serverOnly on the ConfigBuilder; serverOnlyConfEntries is the private registry that tracks them. - Softened the "Never" rules for process.destroy() and SparkSession.active to "do not introduce new usage", since legacy usages exist in Spark plugin entry points that require a no-arg constructor.
Markdown reflows naturally on render; the hard wraps at ~70 chars were a stylistic choice from the initial draft that made the source harder to edit. Code fences, tables, nested list items, and the license header are preserved unchanged.
- Server boundary: rephrase from "no Spark/Flink/Hive/Trino dependency" to "no Kyuubi engine module dependency". The server legitimately uses io.trino:trino-client for the Trino frontend and similar protocol/client libraries are allowed. - Engine profile matrix: Flink also has version profiles (flink-1.17/1.18/1.19/1.20). Updated the table and the surrounding text; Hive and Trino remain pinned. - Spark SQL syntax: Kyuubi extensions already add project-specific clauses (OPTIMIZE ... ZORDER BY). Rephrased from "no SQL syntax that Spark doesn't support" to require maintainer agreement and version-gated parser support for new clauses. - Thrift dependencies: kyuubi-common pulls both kyuubi-relocated-thrift and kyuubi-relocated-hive-service-rpc; spelled both out. - Utils.isTesting: legacy usages remain in production (EngineRef, JdbcProcessBuilder, TrinoProcessBuilder, BatchesResource). Softened from "never" to "do not introduce new ones without reviewer agreement". - git config user.email: removed the "reviewers will block" framing, which is not documented as a hard policy; kept the practical fix-it guidance.
…d process.destroy primary rules - Pre-flight: warn that Apache RAT scans the working tree and only honors .rat-excludes (not .gitignore), so private ignored files outside the repo root are safer. - Scala specifics: rephrase the SparkSession.active rule to match the softened Never-list wording (allow Spark plugin/catalog entry points with no parameterized path). - Error handling: rephrase the process.destroy() rule to "do not introduce direct process.destroy() / destroyForcibly()" for consistency with the Never list.
… Spark/Iceberg style
pan3793
left a comment
There was a problem hiding this comment.
@wangzhigang1999, thank you for drafting this proposal, I went through it and left some comments, some points might need more discussion.
|
|
||
| Allowed under ASF guidance and Kyuubi's PR template. | ||
|
|
||
| - Disclose tools via `Generated-by: <tool name and version>` in the PR template field. |
There was a problem hiding this comment.
I feel this is actually not a good practice; we borrowed it from Spark, which was derived from ASF guidance
When providing contributions authored using generative AI tooling, a recommended practice is for contributors to indicate the tooling used to create the contribution. This should be included as a token in the source control commit message, for example including the phrase “Generated-by: ”.
https://www.apache.org/legal/generative-tooling.html
I think this is a misinterpretation of that, Generated-by: used to be an informal git commit message tag proposed by the Linux kernel community, and never got adopted. It is also misleading - we usually use LLM and Agent for not only code generation, but also code review, code reading, solution exploring, etc., the Generated-by: makes me feel like the whole patch is generated by the AI and I am just a transporter delivering it to GitHub.
Instead, I suggest we update this to adopt the Assisted-by: tag, which has been adopted by the Linux kernel community.
torvalds/linux@78d979d
https://docs.kernel.org/process/coding-assistants.html
There was a problem hiding this comment.
Agreed on dropping Generated-by:. The thing I'm not sure about is what to put after it.
I went through all 221 real Assisted-by: trailers in torvalds/linux — formats are pretty messy (claude-opus-4-6 / 4.6 / just Claude...), and around 17% use the [TOOL] slot, almost always for coccinelle or kconfiglint, neither of which we run.
So three ways we could go:
-
Copy kernel as-is:
Assisted-by: <AGENT>:<MODEL> [TOOL1] [TOOL2]. My worry is that "tool" in an AI context easily reads as "agent tool call" (Read/Edit/Bash), and people will start dumping their IDE or formatter in there. -
Same thing but drop the tool slot:
Assisted-by: <AGENT>:<MODEL>. This is actually what ~80% of real kernel commits look like anyway, so we'd be matching what people actually do, not what the spec says. -
Just rename the field, keep the value free-form:
Assisted-by: <tool name and version>. Smallest change to the existing PR template.
I lean toward 2. WDYT?
(Whichever we pick I'll update .github/PULL_REQUEST_TEMPLATE in this PR too.)
- Use `apache` remote; Kyuubi scripts assume the name - Engine list trails with `etc.`; drop Hive from the example set - Add `-am` to single-module Maven examples - Spell out "dependency" instead of "dep" - Drop the "Prefer Java" stance; both languages are first-class - Soften `Option[T]` rule with perf exception - Remove Python thrift-patches reference - Note Thrift Operation reuse with operation-level config keys - Cite fixed version when referencing upstream issues - Keep PR prose precise and concise - Switch AI disclosure to `Assisted-by: <AGENT>:<MODEL>` - Loosen Java test note to defer to surrounding module style - Mention GitHub Discussions alongside the dev mailing list
f9f67d4 to
d0c685c
Compare
…NT>:<MODEL_ID> - AGENTS.md: pin the example to the canonical model id (`claude-opus-4-7`). - PULL_REQUEST_TEMPLATE: rename the field from `Generated-by:` to `Assisted-by:`, broaden the prompt to cover review/exploration, and align the example.
|
|
||
| Allowed under [ASF generative tooling guidance](https://www.apache.org/legal/generative-tooling.html) and Kyuubi's PR template. AI is used for more than code generation (review, reading, solution exploration), so disclose assistance rather than implying the patch was machine-authored. | ||
|
|
||
| - Disclose assistance via `Assisted-by: <AGENT>:<MODEL_VERSION>` in the PR template field, following the [Linux kernel convention](https://docs.kernel.org/process/coding-assistants.html). Example: `Assisted-by: Claude:claude-opus-4-7`. |
There was a problem hiding this comment.
Thank you for the update. Would you also update the pull request template?
Why are the changes needed?
Resolves #7445. Adds
AGENTS.mdat the repo root — a terse, command-oriented guide for AI coding agents contributing to Kyuubi. ComplementsCONTRIBUTING.md. Content is distilled from review feedback on landed PRs and aligned with sibling Apache projects (Spark, Iceberg, Airflow, DataFusion); the methodology and underlying rule set are linked in #7445.How was this patch tested?
Documentation-only change. Every referenced command, Maven profile, script path, class name, and file path was verified against the current
master. Maven invocations (-Pfast,-DwildcardSuites,-Dtest,dev/reformatvia Spotless) were run on tiny modules to confirm the documented usage actually executes.Was this patch authored or co-authored using generative AI tooling?
Assisted-by: Claude:claude-opus-4-7