Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions TD-xxx-try-election-with-applied
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
改进目前raft的选举实现,加入对appliedIndex的判断,appliedIndex大的更优先被选举为leader。不要替我创建git分支。
36 changes: 36 additions & 0 deletions docs/001-raft-appliedindex-election/checklists/requirements.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Specification Quality Checklist: Raft Election with Applied Progress Priority

**Purpose**: Validate specification completeness and quality before proceeding to planning
**Created**: 2026-04-10
**Feature**: [spec.md](../spec.md)

## Content Quality

- [x] No implementation details (languages, frameworks, APIs)
- [x] Focused on user value and business needs
- [x] Written for non-technical stakeholders
- [x] All mandatory sections completed

## Requirement Completeness

- [x] No [NEEDS CLARIFICATION] markers remain
- [x] Requirements are testable and unambiguous
- [x] Success criteria are measurable
- [x] Success criteria are technology-agnostic (no implementation details)
- [x] All acceptance scenarios are defined
- [x] Edge cases are identified
- [x] Scope is clearly bounded
- [x] Dependencies and assumptions identified

## Feature Readiness

- [x] All functional requirements have clear acceptance criteria
- [x] User scenarios cover primary flows
- [x] Feature meets measurable outcomes defined in Success Criteria
- [x] No implementation details leak into specification

## Notes

- Validation iteration 1: all checklist items passed.
- No clarification questions required.
- Branch creation step was intentionally skipped per user instruction; artifacts were created directly under the sequential feature directory.
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Contract: RequestVote Applied-Index Preference

## Interface

- Interface type: Internal node-to-node sync RPC contract
- Producer: `syncNodeRequestVotePeers`
- Consumer: `syncNodeOnRequestVote`
- Message: `SyncRequestVote`

## Request Contract

Required fields for vote evaluation:

| Field | Meaning | Requirement |
|------|---------|-------------|
| `term` | Candidate term | Must equal sender current term |
| `lastLogTerm` | Candidate last log term | Must be populated from local sync log state |
| `lastLogIndex` | Candidate last log index | Must be populated from local sync log state |
| `candidateAppliedIndex` | Candidate applied progress snapshot | Must be populated from `FpAppliedIndexCb` at request-build time |

## Vote Evaluation Rules

1. Reject requests from nodes not in the raft group.
2. Update local term and step down if the request term is higher, preserving existing behavior.
3. Apply the existing log-recency gate based on `lastLogTerm` and `lastLogIndex`.
4. If the candidate fails the log-recency gate, reject the vote regardless of applied progress.
5. If the candidate passes the log-recency gate, compare `candidateAppliedIndex` with the receiver's local applied index.
6. Prefer granting votes to the candidate with greater or equal applied progress.
7. If either side reports an unavailable applied index, treat applied progress as unavailable for ordering and fall back to existing deterministic behavior.
8. If applied progress is equal, fall back to existing deterministic behavior.
9. Reset the election timer only when the vote is actually granted, preserving current liveness behavior.

## Observability Contract

- Vote-decision logs must include both log-recency inputs and applied-progress inputs.
- Logs for unequal applied indexes must reveal whether the applied-progress comparison influenced the grant result.
- Election outcome logs must allow operators to explain why the elected node was preferred.

## Compatibility Assumption

- This contract change assumes homogeneous cluster binaries for the feature rollout.
- Mixed-version compatibility is not included in this feature plan.
- Operational rollout guidance MUST document same-version deployment as a prerequisite.

Check failure on line 43 in docs/001-raft-appliedindex-election/contracts/request-vote-applied-index.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Files should end with a single newline character

docs/001-raft-appliedindex-election/contracts/request-vote-applied-index.md:43:87 MD047/single-trailing-newline Files should end with a single newline character https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md047.md
61 changes: 61 additions & 0 deletions docs/001-raft-appliedindex-election/data-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# Data Model: Raft Election with Applied Progress Priority

## Candidate Vote Request

- Purpose: Internal request-vote message emitted by a candidate to each voting peer during an election round.
- Fields:
- `srcId`: Candidate raft identity.
- `destId`: Target voter identity.
- `term`: Candidate term for the election round.
- `lastLogIndex`: Candidate last log index used by the existing recency gate.
- `lastLogTerm`: Candidate last log term used by the existing recency gate.
- `candidateAppliedIndex`: Candidate applied progress snapshot used for preference ordering.
- Validation rules:
- `term` must equal the candidate's current persisted term at send time.
- `candidateAppliedIndex` must be derived from the FSM callback at the time the request is built.
- Message consumers must tolerate equal applied-index values and preserve deterministic tie fallback.

## Local Applied Progress Snapshot

- Purpose: The receiving node's current view of its own applied progress at vote-evaluation time.
- Fields:
- `localAppliedIndex`: Value returned by `FpAppliedIndexCb`.
- `localLastLogIndex`: Last local log index.
- `localLastLogTerm`: Last local log term.
- Validation rules:
- `localLastLogTerm` and `localLastLogIndex` remain the first-stage election safety check.
- `localAppliedIndex` is compared only after the candidate satisfies the safety gate.

## Election Decision Record

- Purpose: Structured evidence for why a vote was granted or rejected and why a leader was elected.
- Fields:
- `term`
- `candidateId`
- `voterId`
- `candidateLastLogTerm`
- `candidateLastLogIndex`
- `candidateAppliedIndex`
- `localLastLogTerm`
- `localLastLogIndex`
- `localAppliedIndex`
- `grantDecision`
- `decisionReason`
- Validation rules:
- Every unequal-applied-index vote decision must be explainable from logged decision factors.
- Logging must not remove or obscure the existing recency decision context.

## Election Round

- Purpose: One leader-election attempt for a vgroup term.
- Fields:
- `term`
- `candidateSet`
- `voteResponses`
- `winner`
- `winnerAppliedIndex`
- `tieFallbackUsed`
- State transitions:
- `Follower -> Candidate`: election starts and self-vote recorded.
- `Candidate -> Leader`: quorum obtained, possibly influenced by applied-progress preference.
- `Candidate/Follower -> Follower`: step-down on higher term or vote for another candidate.

Check failure on line 61 in docs/001-raft-appliedindex-election/data-model.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Files should end with a single newline character

docs/001-raft-appliedindex-election/data-model.md:61:93 MD047/single-trailing-newline Files should end with a single newline character https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md047.md
87 changes: 87 additions & 0 deletions docs/001-raft-appliedindex-election/plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Implementation Plan: Raft Election with Applied Progress Priority

**Branch**: `[001-raft-appliedindex-election]` | **Date**: 2026-04-10 | **Spec**: `/root/github/taosdata/TDinternal/specs/001-raft-appliedindex-election/spec.md`
**Input**: Feature specification from `/specs/001-raft-appliedindex-election/spec.md`

**Note**: Planning was bound to `SPECIFY_FEATURE=001-raft-appliedindex-election` because the user explicitly requested no branch creation.

## Summary

Improve TDinternal's Raft leader election so that eligible candidates with higher applied progress are preferred, while preserving existing last-log-term/last-log-index safety checks and election liveness. The implementation will extend the internal `SyncRequestVote` contract to carry the candidate's applied index, incorporate an applied-progress comparison into vote granting after existing log recency validation, and add observability plus unit/integration coverage for unequal-progress, tie, and lagging-node election scenarios.

## Technical Context

**Language/Version**: C for sync runtime, C++ for existing Google Test coverage
**Primary Dependencies**: TDinternal sync library, internal RPC message layer, WAL/log store, FSM callbacks, Google Test
**Storage**: Raft store metadata, WAL-backed log store, FSM-reported applied progress
**Testing**: Existing `community/source/libs/sync/test` Google Test executables plus targeted cluster-style sync tests
**Target Platform**: Linux server nodes running TDinternal sync/raft services
**Project Type**: Native distributed systems library
**Performance Goals**: Keep vote evaluation O(1), add no extra election round-trips, and avoid measurable regression in election completion under current replica counts
**Constraints**: Preserve Raft log safety, preserve deterministic behavior on ties, keep elections live when applied progress is unavailable/equal, and treat mixed-version wire compatibility as out of scope unless separately planned
**Scale/Scope**: Per-vgroup replica elections across small quorum-based replica sets with optional learners; change scope is limited to sync election logic, internal message schema, and sync tests

## Constitution Check

*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*

- Constitution file review result: `.specify/memory/constitution.md` still contains template placeholders and no ratified enforceable principles.
- Gate status before research: PASS by absence of enforceable constitutional constraints.
- Required caution despite missing constitution rules: preserve existing Raft safety invariants, keep plan scoped to sync election paths, and do not assume branch creation is available.
- Post-design re-check: PASS. The design stays within the sync library boundary, preserves current safety gating on log recency, and adds only one internal message-contract change plus corresponding tests.

## Project Structure

### Documentation (this feature)

```text
specs/001-raft-appliedindex-election/
├── plan.md
├── research.md
├── data-model.md
├── quickstart.md
├── contracts/
│ └── request-vote-applied-index.md
└── tasks.md
```

### Source Code (repository root)

```text
community/source/libs/sync/
├── inc/
│ ├── syncElection.h
│ ├── syncInt.h
│ ├── syncMessage.h
│ ├── syncRequestVote.h
│ └── syncUtil.h
├── src/
│ ├── syncElection.c
│ ├── syncMain.c
│ ├── syncMessage.c
│ ├── syncRequestVote.c
│ └── syncUtil.c
└── test/
├── syncElectTest.cpp
├── syncRequestVoteTest.cpp
├── syncVotesGrantedTest.cpp
└── sync_test_lib/

community/tests/
└── pytest/
└── cluster/
└── syncingTest.py

community/contrib/test/traft/
├── cluster/
├── join_into_vgroup/
└── single_node/
```

**Structure Decision**: Keep all runtime changes inside `community/source/libs/sync/{inc,src}` and validate behavior primarily through `community/source/libs/sync/test`. Use existing cluster-style test scaffolding under `community/contrib/test/traft` or `community/tests/pytest/cluster` only if unit-level coverage cannot express a required election scenario.

## Complexity Tracking

| Violation | Why Needed | Simpler Alternative Rejected Because |
|-----------|------------|-------------------------------------|
| Internal wire contract change | `SyncRequestVote` must carry candidate applied progress so remote voters can compare candidates using data they do not currently receive | Deriving candidate applied progress locally is impossible with current message payloads |
51 changes: 51 additions & 0 deletions docs/001-raft-appliedindex-election/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Quickstart: Raft Election with Applied Progress Priority

## 1. Build targeted sync tests

From the repository root:

```sh
cmake --build build --target syncRequestVoteTest syncVotesGrantedTest syncElectTest
```

## 2. Implement the runtime change

Update the sync runtime in these areas:

- Add the applied-index field to `SyncRequestVote` in `community/source/libs/sync/inc/syncMessage.h`.
- Populate the field in `community/source/libs/sync/src/syncElection.c` when building vote requests.
- Read local applied progress from `FpAppliedIndexCb` in `community/source/libs/sync/src/syncRequestVote.c` and incorporate it after the existing log-recency gate.
- Extend request-vote logging in `community/source/libs/sync/src/syncUtil.c` or adjacent sync logging paths so operators can see the comparison inputs.
Comment on lines +13 to +18

## 3. Add focused test coverage

Recommended first-pass coverage:

- `syncRequestVoteTest`: grant and reject decisions for higher, lower, equal, and unavailable applied-index cases.
- `syncVotesGrantedTest` or adjacent helpers: ensure tie handling remains deterministic.
- `syncElectTest`: verify a higher-applied candidate wins once all candidates are otherwise log-eligible.

## 4. Run targeted tests

Run the built binaries from the build output directory, for example:

```sh
./build/community/source/libs/sync/test/syncRequestVoteTest
./build/community/source/libs/sync/test/syncVotesGrantedTest
./build/community/source/libs/sync/test/syncElectTest
```

## 5. Capture failover baseline and comparison

- Record a baseline failover drill before enabling the new election preference.
- Re-run the same drill after the change using the same replica topology and workload.
- Compare median leader recovery completion time and keep the before/after timestamps with the test logs.
- If using `community/tests/pytest/cluster/syncingTest.py`, record total elapsed time around the relevant replica and schema-change sequence.

## 6. Verify observability and rollout assumptions

- Trigger an election scenario with unequal applied progress.
- Confirm logs include candidate/local applied index, last-log term/index, and the grant reason.
- Confirm tie scenarios still elect a single leader without repeated deadlock.
- Confirm unavailable applied index falls back to the existing deterministic path instead of blocking votes.
- Roll out only to same-version clusters because the RequestVote wire contract has changed.

Check failure on line 51 in docs/001-raft-appliedindex-election/quickstart.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Files should end with a single newline character

docs/001-raft-appliedindex-election/quickstart.md:51:91 MD047/single-trailing-newline Files should end with a single newline character https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md047.md
37 changes: 37 additions & 0 deletions docs/001-raft-appliedindex-election/research.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Research: Raft Election with Applied Progress Priority

## Decision 1: Keep last-log-term and last-log-index as the primary safety gate

- Decision: Preserve the current `lastLogTerm` and `lastLogIndex` vote eligibility logic as the first-stage gate, and only evaluate applied progress after the candidate passes that existing Raft recency check.
- Rationale: The current implementation in `syncRequestVote.c` enforces Raft-style log freshness. Replacing that check with applied progress would risk violating election safety and diverging from established sync behavior.
- Alternatives considered: Use applied progress as the sole election criterion. Rejected because applied progress does not encode log freshness or term ordering and would weaken the existing safety guard.

## Decision 2: Extend `SyncRequestVote` with candidate applied index

- Decision: Add a candidate applied-index field to the internal `SyncRequestVote` message and populate it during `syncNodeRequestVotePeers`.
- Rationale: Remote voters cannot prefer the most up-to-date candidate unless the candidate's applied progress is transmitted with the vote request. The existing contract only carries term and log recency metadata.
- Alternatives considered: Infer candidate applied progress from heartbeats or cached peer state. Rejected because cached progress may be stale or absent exactly when an election is needed.

## Decision 3: Source applied progress from the FSM callback already exposed by sync

- Decision: Use `SSyncFSM::FpAppliedIndexCb` as the authoritative provider for local applied progress during vote request construction and vote evaluation.
- Rationale: The sync layer already invokes this callback for observability in `syncUtil.c`, which makes it the least invasive and most consistent source of applied progress.
- Alternatives considered: Introduce a new state field on `SSyncNode` or reuse `commitIndex`. Rejected because that duplicates existing data flow or fails to represent actually applied progress.

## Decision 4: Define applied progress as a deterministic preference factor, not a liveness blocker

- Decision: Use applied progress to prefer the candidate with the highest applied index when log recency is acceptable, but retain deterministic fallback behavior when applied indexes are equal or unavailable.
- Rationale: The feature request requires higher applied progress to be preferred, while the specification also requires elections to remain live under ties and degraded conditions.
- Alternatives considered: Refuse all votes for candidates with missing or lower applied index. Rejected because it could deadlock elections during partial visibility or uniform lag.

## Decision 5: Treat mixed-version message compatibility as out of scope for this feature slice

- Decision: Plan this feature as a homogeneous-version cluster change that updates both message producer and consumer together.
- Rationale: `SyncRequestVote` is a raw internal message contract with fixed-size allocation in `syncMessage.c`; adding fields changes the on-wire layout. Supporting mixed-version clusters would require a broader version-negotiation or dual-decoding design not requested here.
- Alternatives considered: Add backward-compatible dual-format decoding in the same feature. Rejected because it expands scope materially beyond election preference behavior.

## Decision 6: Validate behavior primarily with sync unit tests, plus optional cluster-style follow-up coverage

- Decision: Add or update tests in `community/source/libs/sync/test` for message contract coverage, vote-grant decision logic, and election outcomes, with cluster-style tests only if a unit harness cannot express a scenario.
- Rationale: The sync test suite already contains focused executables for request-vote, vote-manager, and election behavior, which is the most efficient place to pin regressions.
- Alternatives considered: Rely solely on manual `traft` cluster programs. Rejected because they are slower and less precise for guarding edge cases like ties and unavailable applied progress.

Check failure on line 37 in docs/001-raft-appliedindex-election/research.md

View workflow job for this annotation

GitHub Actions / check-with-markdownlint

Files should end with a single newline character

docs/001-raft-appliedindex-election/research.md:37:192 MD047/single-trailing-newline Files should end with a single newline character https://github.com/DavidAnson/markdownlint/blob/v0.38.0/doc/md047.md
Loading
Loading