User ns reworked#4
User ns reworked#4mauriciovasquezbernal wants to merge 10 commits intorata/sidecar-ordering-annotations-1.17from
Conversation
a0ab60f to
b18cd2d
Compare
rata
left a comment
There was a problem hiding this comment.
I did some silly comments.
I'll need to review with more time the changes for the kuberuntime. Is tomorrow okay? :)
41c3fe9 to
8992dbc
Compare
GetRuntimeConfigInfo returns information about the configuration of the runtime. For now it only returns the uid/gid mappings configuration.
Implement a new helper that will be used in the following commits
Extend the CRI to include a new user namespace mode field to allow kubelet to specify what is the prefered user namespace mode for a given pod.
This commit implements support for user ns in the kubelet. The kubelet uses the GetRuntimeInfoConfig function of the runtime to query for the uid/gid configured mapping. Kubelet tries to use POD mode for the user namespace when possible, NODE is used when: - Feature is not supported nor enabled in the runtime - The value of the "alpha.kinvolk.io/userns" annotation is "node" - The pod specification is imcompatible with it -- Any host namespace is used (IPC, PID, NET) -- There is any host-path volume -- There is any non namespaced capability (MKNOD, SYS_TIME, SYS_MODULE) -- There is any privileged container -- The pod has PVC mounts Files under the pod volumes dir (/var/lib/kubelet/pods/xxxx/volumes) are chowned to the mapped user in the host if the user namespace is used.
f223792 to
31640cb
Compare
rata
left a comment
There was a problem hiding this comment.
@mauriciovasquezbernal thanks a lot for the patch, looks very good!
Added some silly questions for some things I didn't follow and wanted to confirm if we want that behavior, but overall LGTM! :)
| return false | ||
| } | ||
| if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
| c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) { |
There was a problem hiding this comment.
nit picking: is the uint32() needed? I think we shouldn't need it. Or am I missing something?
I mean, golang does force to use some explicit conversions, but as explained here I don't think it should be needed to compare a var with a constant.
Am I missing something?
There was a problem hiding this comment.
You're right. I'll remove them.
| return false | ||
| } | ||
| if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
| c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) { |
There was a problem hiding this comment.
why c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295)? I mean, why equal to that constant value? It is set to that when the runtime doesn't support user ns?
There was a problem hiding this comment.
4294967295 encapsulates the entire uid/gid range I believe. The initial root namespace has a uid_map of 0 0 4294967295. But I agree that the use of it here is unclear
There was a problem hiding this comment.
Yes, the idea is that if the configured mapping is 0 0 4294967295 it means that the support is not enabled (all the ids in the container are mapped to the same ids in the host).
There was a problem hiding this comment.
Ohh, I see. Thanks! Maybe a comment can make it very clear to everyone :)
| return false | ||
| } | ||
| if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
| c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(0) { |
| single := &RuntimeConfigInfo{ | ||
| UserNamespaceConfig: UserNamespaceConfigInfo{ | ||
| UidMappings: []*UserNSMapping{ | ||
| &UserNSMapping{ |
There was a problem hiding this comment.
Maybe add a test with more than one mapping too, given that this is a slice?
| if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
| c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(0) { |
There was a problem hiding this comment.
Why if len(...) == 1 and hostID == 0 and size == 0? Not sure I follow what you are trying to detect.
Is it just when the size is 0? Why does it matter, in that case, that the hostID is 0 too? Or why wouldn't it be a problem if the specified mappings are more than one?
There was a problem hiding this comment.
It's trying to detect the case where a mapping like 0 1000 0 is configured. You're totally right, the whole logic to check if the usernamespace is supported or enabled has to be checked again. I think it could be simplified a lot but I avoided touching that at this time.
| gidMapping.Size_ = gidMappingSize | ||
| } else { | ||
| uidMapping.Size_ = uint32(4294967295) | ||
| gidMapping.Size_ = uint32(4294967295) |
There was a problem hiding this comment.
What is 4294967295? The max value for something? Is it constant across platforms?
| // container_id is the starting id for the mapping inside the container. | ||
| uint32 container_id = 1; | ||
| // host_id is the starting id for the mapping on the host. | ||
| uint32 host_id = 2; |
There was a problem hiding this comment.
Why use uint32 for container and host UIDs? Is it because in all platforms will be 32 or less?
I think we talked about this in the past, about being maybe an unsigned int in C, but can't find that thread now. Sorry to re-ask :-/
| &kubecontainer.RuntimeConfigInfo{}, | ||
| false, | ||
| }, | ||
| "Single mapping": { |
There was a problem hiding this comment.
Idem, will be nice to test more than once mapping, just in case
|
|
||
| cadvisorapi "github.com/google/cadvisor/info/v1" | ||
| "google.golang.org/grpc/codes" | ||
| "google.golang.org/grpc/status" |
There was a problem hiding this comment.
status and codes can be too generic and, in the future, clash with other imports. Maybe should we use a name that makes it clear that is a grpc specific thing?
Not now, but in the future. Ah, and only if this codes continues to live here :)
cdesiniotis
left a comment
There was a problem hiding this comment.
Hi! Overall this is really good work. I left a few comments/questions.
| if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
| c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) { |
There was a problem hiding this comment.
I think this would be more clear.
| if len(c.UserNamespaceConfig.UidMappings) == 1 && | |
| c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) { | |
| if len(c.UserNamespaceConfig.UidMappings) == 1 && | |
| c.UserNamespaceConfig.UidMappings[0].HostID == c.UserNamespaceConfig.UidMappings[0].ContainerID { |
| return false | ||
| } | ||
| if len(c.UserNamespaceConfig.UidMappings) == 1 && | ||
| c.UserNamespaceConfig.UidMappings[0].HostID == uint32(0) && c.UserNamespaceConfig.UidMappings[0].Size == uint32(4294967295) { |
There was a problem hiding this comment.
4294967295 encapsulates the entire uid/gid range I believe. The initial root namespace has a uid_map of 0 0 4294967295. But I agree that the use of it here is unclear
|
|
||
| // getRemappedNonRootHostID parses docker info to determine ID on the host usernamespace which is mapped to {U/G}ID 0 in the container user-namespace | ||
| func getRemappedNonRootHostID(dockerInfo *dockertypes.Info) (uint32, error) { | ||
| remappedNonRootHostID64, err := strconv.ParseUint(strings.Split(path.Base(dockerInfo.DockerRootDir), ".")[0], 10, 0) |
There was a problem hiding this comment.
The output of path.Base(. . .) is /var/lib/docker if userns is not enabled, and /var/lib/docker/1000.1000 if userns is enabled (replace 1000 with whatever uid remapping is configured). I agree this is a little hard to review, but it works :)
| // User namespace for this container/sandbox. | ||
| // Note: There is currently no way to set CONTAINER scoped user namespace in the Kubernetes API. | ||
| // NODE is the default value. Kubelet will set it to POD if pod spec indicates to use user-namespace remapping | ||
| // Namespaces currently set by the kubelet: POD, NODE |
There was a problem hiding this comment.
Does POD mean that all pods will be remapped to the same namespace (assuming userns are supported/enabled at runtime)? If so, I think the term POD is unclear as it suggests that each pod gets mapped to its own namespace. What happened to NODE_WIDE_REMAPPED?
There was a problem hiding this comment.
Pod means each pod will have it's own namespace. The mapping, at least in this PoC, is the same for all pods, though.
NODE_WIDE_REMAPPED is not implemented here. We find out it was easier to implement a per-pod user ns, that also seems better. IIUC, NODE_WIDE_REMAPPED was proposed years ago before most container runtime added support for user ns and as a way to run containers as root but not be root on the host. We can implement that if there is still value, though :)
There was a problem hiding this comment.
Thanks for the clarification. I don't think NODE_WIDE_REMAPPED is necessary. But maybe a comment here clarifying what POD means (and what it will mean in the future) would be helpful.
| makeIPTablesUtilChains: kubeCfg.MakeIPTablesUtilChains, | ||
| iptablesMasqueradeBit: int(kubeCfg.IPTablesMasqueradeBit), | ||
| iptablesDropBit: int(kubeCfg.IPTablesDropBit), | ||
| experimentalHostUserNamespaceDefaulting: utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalHostUserNamespaceDefaultingGate), |
There was a problem hiding this comment.
Is there a reason why this entire sections shows as a diff? I believe you are just removing the experimentalHostUserNamespaceDefaulting feature
| managedHostsHeaderWithHostNetwork = "# Kubernetes-managed hosts file (host network).\n" | ||
|
|
||
| // Kinvolk alpha annotation for user namespaces | ||
| kivolkUsernsAnn = "alpha.kinvolk.io/userns" |
There was a problem hiding this comment.
Is using an annotation a temporary fix? What's the reasoning behind not updating the pod spec?
There was a problem hiding this comment.
We are using an annotation to easily test on real clusters without changing the pod.Spec. Upstream PR will change the pod spec.
| } | ||
| return false | ||
| } | ||
|
|
There was a problem hiding this comment.
nit: add comment for UserNamespaceForPod() for consistency
thanks for the review and your answers, really helpful! :) |
|
Is this expected to work with dockershim at the moment? |
|
Yes, it should work when using the |
|
Hmm okay. I get permission issues when kubelet attempts to mount from |
|
@cdesiniotis I have the following: Can you share the exact error you're getting? |
|
@mauriciovasquezbernal My setup: I run some pods. When I specify to use the pod userns it fails: Looking in the kubelet logs at |
|
@cdesiniotis oh right. I suppose you are running containerd with systemd, if that's the case it needs to have the supplemental group 0 ( |
|
@mauriciovasquezbernal Ah that was it. Thank you for the help! |
| ) | ||
|
|
||
| var ( | ||
| linuxIDMappingRegexp = regexp.MustCompile("([aA-zZ]+):([0-9]+):([0-9]+)") |
There was a problem hiding this comment.
It seems that the first field can be numbers too. We probably need to modify this. Or maybe just use Split() with : as separator? Might be simpler, not sure
From https://docs.docker.com/engine/security/userns-remap/#prerequisites:
Each file contains three fields: the username or ID of the user, followed by a beginning UID or GID (which is treated as UID or GID 0 within the namespace) and a maximum number of UIDs or GIDs available to the user
There was a problem hiding this comment.
Additionally, a typical accepted regex for usernames in Linux is "^[a-z][-a-z0-9_]*\$", so need to account for numbers in the first field anyways.
|
@rata @cdesiniotis I'm checking the state of the art on the proposals for this feature first. I'll address your comments once I have a clear view of what are the next steps on this implementation. Thanks a lot for the review. |
|
@mauriciovasquezbernal Thanks for your work! I am willing to help/review, so feel free to include me for review on this (or related work). Also, what are the current plans for upstream work (i.e. KEP and upstream PR)? Are you (@rata @mauriciovasquezbernal) still planning to lead that effort? |
Add two examples with SYS_ADMIN and NET_ADMIN capabilities.
f4ddac6 to
6e902e7
Compare
- Differenciate when it's a limitation on the Pod spec or in the runtime - Don't drop the content of the error message shown to the user
6e902e7 to
b14dc20
Compare
There are some features that *could* be not compatible with user namespaces: - hostpath volumes - PVC - sharing other host namespaces - some capabilities The current code forbids to use 'Pod' mode when one of those features is present on the PodSpec. This commit relaxes that logic and replaces it by a warning only for the case where host namespaces are shared given that it's the only case that could present issues when creating the container.
It's a reworked version of #3.