KEP-5709: Add well-known pod network readiness gate#5995
KEP-5709: Add well-known pod network readiness gate#5995tssurya wants to merge 1 commit intokubernetes:masterfrom
Conversation
tssurya
commented
Apr 5, 2026
- One-line PR description: Adds a well-known pod network readiness gate
- Issue link: Pod readiness gate for network readiness #5709
- Other comments: Stems from KEP-4559 Redesigning Kubelet probes #4558 (comment)
Signed-off-by: Surya Seetharaman <suryaseetharaman.9@gmail.com>
|
/sig network |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tssurya The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| status: implementable | ||
| creation-date: 2026-04-05 | ||
| reviewers: | ||
| - "@danwinship" |
There was a problem hiding this comment.
who else should I add for reviewers and approvers how to get these names?
There was a problem hiding this comment.
I'm happy to help with a review. I'm unsure if there are requirements for a reviewer though.
There was a problem hiding this comment.
thank you Adrian! I'll add you as well to the reviewers list
|
|
||
| # The following PRR answers are required at alpha release | ||
| # List the feature gate name and the components for which it must be enabled | ||
| feature-gates: |
There was a problem hiding this comment.
I think we don't need a feature gate but I can't tell...
There was a problem hiding this comment.
You need a feature gate if you're modifying any core components (eg, kubelet), but not if the changes are all external to k/k
|
@tssurya: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
|
||
| <<[/UNRESOLVED]>> | ||
|
|
||
| ### Approach A: Network plugin webhook (no core changes) |
There was a problem hiding this comment.
I think I'm gonna get huge pushback for B and C approaches :D but wanted to put them up here first for review and then move them into alternatives
some of the later sections are not filled in just to allow some time for convergence on the approach
|
cc @fasaxc @caseydavenport @joestringer PTAL since this is something that might be of interest to the network plugins |
| status: implementable | ||
| creation-date: 2026-04-05 | ||
| reviewers: | ||
| - "@danwinship" |
There was a problem hiding this comment.
I'm happy to help with a review. I'm unsure if there are requirements for a reviewer though.
| - **Con:** If the webhook is unavailable and `failurePolicy` is | ||
| `Ignore`, pods are created without the gate and silently lose | ||
| protection. |
There was a problem hiding this comment.
Is there another con for when the failurePolicy is set to Fail, which may cause pods to be unable to be created??
(I don't know if this is implementation specific, I don't have much CNI experience, but can it be possible for only Pods that require the CNI plugin to be matched by the webhook? I assume this may be out of scope of the KEP.)
There was a problem hiding this comment.
Is there another con for when the
failurePolicyis set toFail, which may cause pods to be unable to be created??
yea that's totally possible as well which has bigger impact but I thought since people are opting into webhooks that's something they live with but I can also call this aspect out, thanks for asking this
(I don't know if this is implementation specific, I don't have much CNI experience, but can it be possible for only Pods that require the CNI plugin to be matched by the webhook? I assume this may be out of scope of the KEP.)
this is a good question. I haven't implemented a webhook myself, but on investigating a bit more, it sounds like it can't differentiate, the closest we can get is a "namespaceSelector" filtering. So the webhook would get all CREATE pod events but inside the webhook handler I'd need to code the logic to ignore spec.hostNetwork and skip those...
There was a problem hiding this comment.
failurePolicy would basically have to be set to Ignore here. Fail is just way too fragile.
There was a problem hiding this comment.
Aren't both options too fragile?
You either get your pod without the CNI (which I assume is an undesired state) or you don't get the pod at all (also undesired, but may be a better failure mode)
But also, if we're letting CNIs handle this webhook, they could do whatever they want when they register the webhook, so I assume we should document both modes as a "Con".
danwinship
left a comment
There was a problem hiding this comment.
Didn't get all the way to the end, but there's already plenty to think about...
| - "@tssurya" | ||
| owning-sig: sig-network | ||
| participating-sigs: | ||
| - sig-network |
There was a problem hiding this comment.
you don't need to list sig-network as both "owning" and "participating"
|
|
||
| # The following PRR answers are required at alpha release | ||
| # List the feature gate name and the components for which it must be enabled | ||
| feature-gates: |
There was a problem hiding this comment.
You need a feature gate if you're modifying any core components (eg, kubelet), but not if the changes are all external to k/k
| Kubernetes currently has no explicit signal for whether a pod's | ||
| network has been fully programmed and is ready to receive traffic. |
There was a problem hiding this comment.
| Kubernetes currently has no explicit signal for whether a pod's | |
| network has been fully programmed and is ready to receive traffic. | |
| Kubernetes currently has no explicit signal for whether a pod | |
| has been fully attached to the pod network and is ready to receive traffic. |
| The closest existing condition, [`PodReadyToStartContainers`][KEP-3085], | ||
| indicates that the pod sandbox has been created and CNI `ADD` has | ||
| returned — but not that the network datapath is fully programmed. | ||
| This KEP introduces a built-in [pod readiness gate][KEP-580] |
There was a problem hiding this comment.
| This KEP introduces a built-in [pod readiness gate][KEP-580] | |
| This KEP introduces a well-known [pod readiness gate][KEP-580] |
It's not built-in to Kubernetes, it's just a standard thing for pod network implementations to implement.
| condition that the network plugin sets to indicate network readiness, | ||
| cleanly separating application readiness (answered by readiness probes) from | ||
| network readiness (answered by the network plugin). This becomes | ||
| especially important as [KEP-4559] moves kubelet probes to run |
There was a problem hiding this comment.
| especially important as [KEP-4559] moves kubelet probes to run | |
| especially important as [KEP-4559] proposes to move kubelet probes to run |
since it's still not even provisional yet...
| outside the cluster (e.g., Ingress or cloud load balancers), as those | ||
| are separate concerns with their own readiness signals. | ||
|
|
||
| ### User Stories |
There was a problem hiding this comment.
User stories are optional and can be omitted if they don't actually tell the reader anything new. (ie, don't just make up user stories to fill in the template, if you've already fully explained the problem to the extent that we understand it in the rest of the KEP)
|
|
||
| A compliance team requires that no traffic reach a pod before its | ||
| NetworkPolicy rules are fully programmed. Today, there is a | ||
| [documented race][np-pod-lifecycle] where a pod can receive traffic |
There was a problem hiding this comment.
No no no, the docs you're pointing to explicitly forbid implementations from having that race condition. You can mark the pod ready when some traffic is denied that should have been accepted, but you can't mark it ready when some traffic is accepted that should have been denied. This KEP should not change that (because requiring that all accept rules are fully programmed might drastically affect startup latency.)
| cluster network interface plus a high-speed RDMA interface. Each | ||
| device may be programmed by a different plugin or driver, and each | ||
| has its own readiness timeline. The pod should not receive traffic | ||
| until all its network devices are fully plumbed. The network plugin |
There was a problem hiding this comment.
I don't think that's correct. The network readiness condition is just about "can the endpoint be reached by Services". As of right now, even when using DRA and multiple networks, Services are always reached over the cluster-default pod network, so that's what the network readiness condition should be checking.
If the code running within the pod needs access to secondary networks to do its job, then that's an application-level readiness issue, not a network readiness issue. (Even if the secondary network is attached, there's no guarantee that the remote database on that secondary network is actually up and running anyway; you would want to have your application-level readiness probe test that, and then in that case, there is no need to explicitly consider secondary-network-reachability.)
How any of this would interact with secondary networks in a future multi-network k8s environment depends on the multi-network networking model...
- If Services always point to endpoints on the cluster-default pod network, then the network readiness API doesn't need to consider other networks.
- If Services can exist on multiple networks, but any given Pod can only be an endpoint of Services on a single network, then we would want the Pod's Readiness to take into account its reachability only on that single network.
- If Services can exist on multiple networks, and a given Pod may be an endpoint of Services on multiple networks, then probably a Pod's overall Readiness should not be tied to its reachability on any particular network, and we just should keep the signal separate from Pod Readiness, and have the service proxy start tracking both Readiness and reachability separately, so that future multi-network service proxies can correctly distinguish things like "Pod A is not ready; Pod B is ready and reachable over Network X but not over Network Y; Pod C is ready and reachable over both Network X and Network Y."
- (though we could simplify and say that multi-network Pods can only be endpoints of Services when they are reachable on all of the networks they are attached to).
| convention described in [KEP-580]. Trade-off: there is no single | ||
| condition name that operators and tooling can rely on across | ||
| clusters, and Approach B (kubelet hardcodes the condition) would | ||
| not work with per-plugin names. |
There was a problem hiding this comment.
Also, if there's no standard name, then you can't know for sure if the feature is being used in a given cluster (ie, if it's guaranteed that your pods won't become ready until the network is plumbed).
| Gives operators and tooling a consistent name to query across any | ||
| cluster while still following the [KEP-580] naming convention. | ||
| Trade-off: in multi-plugin clusters only one plugin can own the | ||
| condition, or the plugins must coordinate who sets it. |
There was a problem hiding this comment.
While I'm not worried about the multi-network case, there is still the problem of clusters where the "pod network implementation" consists of multiple unrelated pieces. For example, if you're using flannel plus kube-network-policies, then both components affect whether the pod is fully reachable, but they don't coordinate with each other enough to be able to do a single condition... hm...
|
Thanks for opening this discussion. However, I'm not sure a readiness gate is enough, we've had a pretty strong signal from our users that they want the network to be ready before their process starts inside the pod. A readiness gate works for incoming service traffic, but it does nothing to delay start-up of the user's app inside the pod. Calico's original design split the CNI plugin and network policy parts, so that the CNI plugin would return as soon as the IPAM was done and veth created. Our policy is arranged so that new pods get no connectivity until the daemonset kicks in and applies the pod-specific rules (be that iptables/nftables/BPF). While we generally "win" that race, we can lose in a large cluster when an app starts quickly and immediately starts making outgoing connections. Many apps are written to fail if their first few requests fail, or if DNS is not accessible immediately. Overall, I'd much rather have a solution that delays container execution inside the pod until we set some flag. Calico now has a mode where the CNI plugin will wait for up to N seconds for the policy to be programmed before continuing. This closes the gap but it might be surprising to CRI/Kubelet if the CNI plugin takes longer than expected. |
|
Yeah, that's another thing we could consider. I know in ovn-kubernetes, we intentionally return from the CNI plugin "early", because, IIRC, kubelet is basically blocked from starting up another pod until the sandbox creation complete, so if every CNI ADD call waits for the pod to be fully networked, it massively slows down the rate at which you can create new pods. Maybe we should fix that instead (since you're right, people really don't want their pods to start up with half-working networking...) |