Skip to content

KEP-5709: Add well-known pod network readiness gate#5995

Open
tssurya wants to merge 1 commit intokubernetes:masterfrom
tssurya:kep-5709-pod-network-readiness-gates
Open

KEP-5709: Add well-known pod network readiness gate#5995
tssurya wants to merge 1 commit intokubernetes:masterfrom
tssurya:kep-5709-pod-network-readiness-gates

Conversation

@tssurya
Copy link
Copy Markdown
Contributor

@tssurya tssurya commented Apr 5, 2026

  • One-line PR description: Adds a well-known pod network readiness gate

Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
@k8s-ci-robot k8s-ci-robot requested a review from danwinship April 5, 2026 11:45
@tssurya
Copy link
Copy Markdown
Contributor Author

tssurya commented Apr 5, 2026

/sig network

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tssurya
Once this PR has been reviewed and has the lgtm label, please assign mikezappa87 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Apr 5, 2026
@k8s-ci-robot k8s-ci-robot requested a review from MikeZappa87 April 5, 2026 11:45
@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 5, 2026
status: implementable
creation-date: 2026-04-05
reviewers:
- "@danwinship"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who else should I add for reviewers and approvers how to get these names?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to help with a review. I'm unsure if there are requirements for a reviewer though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you Adrian! I'll add you as well to the reviewers list


# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need a feature gate but I can't tell...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a feature gate if you're modifying any core components (eg, kubelet), but not if the changes are all external to k/k

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@tssurya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-test 7bc66af link true /test pull-enhancements-test
pull-enhancements-verify 7bc66af link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


<<[/UNRESOLVED]>>

### Approach A: Network plugin webhook (no core changes)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm gonna get huge pushback for B and C approaches :D but wanted to put them up here first for review and then move them into alternatives
some of the later sections are not filled in just to allow some time for convergence on the approach

@tssurya
Copy link
Copy Markdown
Contributor Author

tssurya commented Apr 5, 2026

cc @fasaxc @caseydavenport @joestringer PTAL since this is something that might be of interest to the network plugins

status: implementable
creation-date: 2026-04-05
reviewers:
- "@danwinship"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to help with a review. I'm unsure if there are requirements for a reviewer though.

Comment on lines +364 to +366
- **Con:** If the webhook is unavailable and `failurePolicy` is
`Ignore`, pods are created without the gate and silently lose
protection.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another con for when the failurePolicy is set to Fail, which may cause pods to be unable to be created??
(I don't know if this is implementation specific, I don't have much CNI experience, but can it be possible for only Pods that require the CNI plugin to be matched by the webhook? I assume this may be out of scope of the KEP.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there another con for when the failurePolicy is set to Fail, which may cause pods to be unable to be created??

yea that's totally possible as well which has bigger impact but I thought since people are opting into webhooks that's something they live with but I can also call this aspect out, thanks for asking this

(I don't know if this is implementation specific, I don't have much CNI experience, but can it be possible for only Pods that require the CNI plugin to be matched by the webhook? I assume this may be out of scope of the KEP.)

this is a good question. I haven't implemented a webhook myself, but on investigating a bit more, it sounds like it can't differentiate, the closest we can get is a "namespaceSelector" filtering. So the webhook would get all CREATE pod events but inside the webhook handler I'd need to code the logic to ignore spec.hostNetwork and skip those...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

failurePolicy would basically have to be set to Ignore here. Fail is just way too fragile.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't both options too fragile?
You either get your pod without the CNI (which I assume is an undesired state) or you don't get the pod at all (also undesired, but may be a better failure mode)

But also, if we're letting CNIs handle this webhook, they could do whatever they want when they register the webhook, so I assume we should document both modes as a "Con".

Copy link
Copy Markdown
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get all the way to the end, but there's already plenty to think about...

- "@tssurya"
owning-sig: sig-network
participating-sigs:
- sig-network
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to list sig-network as both "owning" and "participating"


# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
feature-gates:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need a feature gate if you're modifying any core components (eg, kubelet), but not if the changes are all external to k/k

Comment on lines +175 to +176
Kubernetes currently has no explicit signal for whether a pod's
network has been fully programmed and is ready to receive traffic.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Kubernetes currently has no explicit signal for whether a pod's
network has been fully programmed and is ready to receive traffic.
Kubernetes currently has no explicit signal for whether a pod
has been fully attached to the pod network and is ready to receive traffic.

The closest existing condition, [`PodReadyToStartContainers`][KEP-3085],
indicates that the pod sandbox has been created and CNI `ADD` has
returned — but not that the network datapath is fully programmed.
This KEP introduces a built-in [pod readiness gate][KEP-580]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This KEP introduces a built-in [pod readiness gate][KEP-580]
This KEP introduces a well-known [pod readiness gate][KEP-580]

It's not built-in to Kubernetes, it's just a standard thing for pod network implementations to implement.

condition that the network plugin sets to indicate network readiness,
cleanly separating application readiness (answered by readiness probes) from
network readiness (answered by the network plugin). This becomes
especially important as [KEP-4559] moves kubelet probes to run
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
especially important as [KEP-4559] moves kubelet probes to run
especially important as [KEP-4559] proposes to move kubelet probes to run

since it's still not even provisional yet...

outside the cluster (e.g., Ingress or cloud load balancers), as those
are separate concerns with their own readiness signals.

### User Stories
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User stories are optional and can be omitted if they don't actually tell the reader anything new. (ie, don't just make up user stories to fill in the template, if you've already fully explained the problem to the extent that we understand it in the rest of the KEP)


A compliance team requires that no traffic reach a pod before its
NetworkPolicy rules are fully programmed. Today, there is a
[documented race][np-pod-lifecycle] where a pod can receive traffic
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no no, the docs you're pointing to explicitly forbid implementations from having that race condition. You can mark the pod ready when some traffic is denied that should have been accepted, but you can't mark it ready when some traffic is accepted that should have been denied. This KEP should not change that (because requiring that all accept rules are fully programmed might drastically affect startup latency.)

cluster network interface plus a high-speed RDMA interface. Each
device may be programmed by a different plugin or driver, and each
has its own readiness timeline. The pod should not receive traffic
until all its network devices are fully plumbed. The network plugin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's correct. The network readiness condition is just about "can the endpoint be reached by Services". As of right now, even when using DRA and multiple networks, Services are always reached over the cluster-default pod network, so that's what the network readiness condition should be checking.

If the code running within the pod needs access to secondary networks to do its job, then that's an application-level readiness issue, not a network readiness issue. (Even if the secondary network is attached, there's no guarantee that the remote database on that secondary network is actually up and running anyway; you would want to have your application-level readiness probe test that, and then in that case, there is no need to explicitly consider secondary-network-reachability.)

How any of this would interact with secondary networks in a future multi-network k8s environment depends on the multi-network networking model...

  1. If Services always point to endpoints on the cluster-default pod network, then the network readiness API doesn't need to consider other networks.
  2. If Services can exist on multiple networks, but any given Pod can only be an endpoint of Services on a single network, then we would want the Pod's Readiness to take into account its reachability only on that single network.
  3. If Services can exist on multiple networks, and a given Pod may be an endpoint of Services on multiple networks, then probably a Pod's overall Readiness should not be tied to its reachability on any particular network, and we just should keep the signal separate from Pod Readiness, and have the service proxy start tracking both Readiness and reachability separately, so that future multi-network service proxies can correctly distinguish things like "Pod A is not ready; Pod B is ready and reachable over Network X but not over Network Y; Pod C is ready and reachable over both Network X and Network Y."
    • (though we could simplify and say that multi-network Pods can only be endpoints of Services when they are reachable on all of the networks they are attached to).

convention described in [KEP-580]. Trade-off: there is no single
condition name that operators and tooling can rely on across
clusters, and Approach B (kubelet hardcodes the condition) would
not work with per-plugin names.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if there's no standard name, then you can't know for sure if the feature is being used in a given cluster (ie, if it's guaranteed that your pods won't become ready until the network is plumbed).

Gives operators and tooling a consistent name to query across any
cluster while still following the [KEP-580] naming convention.
Trade-off: in multi-plugin clusters only one plugin can own the
condition, or the plugins must coordinate who sets it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm not worried about the multi-network case, there is still the problem of clusters where the "pod network implementation" consists of multiple unrelated pieces. For example, if you're using flannel plus kube-network-policies, then both components affect whether the pod is fully reachable, but they don't coordinate with each other enough to be able to do a single condition... hm...

@fasaxc
Copy link
Copy Markdown

fasaxc commented Apr 8, 2026

Thanks for opening this discussion. However, I'm not sure a readiness gate is enough, we've had a pretty strong signal from our users that they want the network to be ready before their process starts inside the pod. A readiness gate works for incoming service traffic, but it does nothing to delay start-up of the user's app inside the pod. Calico's original design split the CNI plugin and network policy parts, so that the CNI plugin would return as soon as the IPAM was done and veth created. Our policy is arranged so that new pods get no connectivity until the daemonset kicks in and applies the pod-specific rules (be that iptables/nftables/BPF). While we generally "win" that race, we can lose in a large cluster when an app starts quickly and immediately starts making outgoing connections. Many apps are written to fail if their first few requests fail, or if DNS is not accessible immediately.

Overall, I'd much rather have a solution that delays container execution inside the pod until we set some flag. Calico now has a mode where the CNI plugin will wait for up to N seconds for the policy to be programmed before continuing. This closes the gap but it might be surprising to CRI/Kubelet if the CNI plugin takes longer than expected.

@danwinship
Copy link
Copy Markdown
Contributor

Yeah, that's another thing we could consider. I know in ovn-kubernetes, we intentionally return from the CNI plugin "early", because, IIRC, kubelet is basically blocked from starting up another pod until the sandbox creation complete, so if every CNI ADD call waits for the pod to be fully networked, it massively slows down the rate at which you can create new pods. Maybe we should fix that instead (since you're right, people really don't want their pods to start up with half-working networking...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/network Categorizes an issue or PR as relevant to SIG Network. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants