This repository was archived by the owner on Feb 23, 2026. It is now read-only.
[1.3] Implement user namespaces#1
Draft
alban wants to merge 14 commits intorelease/1.3from
Draft
Conversation
…n, User Initial work by vikaschoudhary16 <vichoudh@redhat.com> on Kubernetes PR 64005.
With this patch, GetRuntimeConfigInfo() returns a config without user namespace: the only uid/gid mapping is the one from the root user namespace.
Before this patch, containerd created a netns, configured it with CNI,
and then creates the sandbox container by giving the netns path previously
setup. This means that the netns was owned by the host userns. Mounting
sysfs in the container is restricted in this setup.
This patch sets up the netns in the other way around instead: it creates
the sandbox container, letting runc create a new netns. Then, it picks
the new netns from /proc/$pid/ns/net, binds mount it in the usual CNI
path and then gives it to CNI to configure. This means that the netns is
owned by the userns of the sandbox container. In this way, mounting
sysfs is possible.
For more information about namespace ownership, see
- man ioctl_ns
- man user_namespaces, section "Interaction of user namespaces and other types of namespaces"
- Linux commit 7dc5dbc879bd ("sysfs: Restrict mounting sysfs")
torvalds/linux@7dc5dbc#diff-4839664cd0c8eab716e064323c7cd71fR1164
- net_current_may_mount() used for mounting sysfs:
ns_capable(net->user_ns, CAP_SYS_ADMIN);
https://github.com/torvalds/linux/blob/v5.7/net/core/net-sysfs.c#L1679
The sandbox container (aka "pause" container) has a tmpfs mount on /dev/shm. Bind mount it with nosuid, noexec, nodev because the mount would not be allowed in user namespaces otherwise.
rata
reviewed
Jun 18, 2020
rata
reviewed
Jun 18, 2020
runc needs to bind mount files from /var/lib/kubelet/pods/... (such as etc-hosts) into the container. When using user namespaces, the bind mount didn't work anymore when containerd is started from a systemd unit. This patch fixes that by adding SupplementaryGroups=0 runc needs to have permission on the directory to stat() the source of the bind mount. Without user namespaces, this is not a problem since runc is running as root, so it has 'rwx' permissions over the directory: drwxr-x---. 8 root root 4096 May 28 18:05 /var/lib/kubelet Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount phase happens before giving up the additional permissions. However, when using user namespaces, the runc process is belonging to a different user than root (depending on the mapping). /var/lib/kubelet is seen as belonging to the special unmapped user (65534, nobody). runc does not have 'rwx' permissions anymore but the empty '---' permission for 'other'. CAP_DAC_OVERRIDE is also no effective because the kernel performs the capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE). This checks that the owner of the /var/lib/kubelet is mapped in the current user namespace, which is not the case. Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was working when containerd was started manually with 'sudo' but not working when started from a systemd unit. The difference is how supplementary groups are handled between sudo and systemd units: systemd does not set supplementary groups by default. $ sudo grep -E 'Groups:|Uid:|Gid:' /proc/self/status Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 $ sudo systemd-run -t grep -E 'Groups:|Uid:|Gid:' /proc/self/status Running as unit: run-u296886.service Press ^] three times within 1s to disconnect TTY. Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: When runc has the supplementary group 0 configured, it is retained during the bind-mount phase, even though it is an unmapped group (runc temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc effectively has the 'r-x' permissions over /var/lib/kubelet. This makes the bind mount of etc-hosts work. After the mount phase, runc will set the credential correctly (following OCI's config.json specification), so the container will not retain this unmapped supplementary group. I rely on the systemd unit file being configured correctly with SupplementaryGroups=0 and I don't attempt to set it up automatically with syscall.Setgroups() because "at the kernel level, user IDs and group IDs are a per-thread attribute" (man setgroups) and the way Golang uses threads make it difficult to predict which thread is going to be used to execute runc. glibc's setgroup() is a wrapper that changes the credentials for all threads but Golang does not use the glibc implementation.
Example of possible configuration:
```
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".node_wide_uid_mapping]
container_id = 0
host_id = 300000
size = 65536
[plugins."io.containerd.grpc.v1.cri".node_wide_gid_mapping]
container_id = 0
host_id = 300000
size = 65536
```
Return an error if NODE_WIDE_REMAPPED is requested
Member
mauriciovasquezbernal
left a comment
There was a problem hiding this comment.
Some minor comments. I think we could remove NODE_WIDE_REMAPPED everywhere on this PR as we removed that on the kubelet one.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is the containerd/cri implementation for the Kubernetes Node-Level User Namespaces Design Proposal.
The patches are based on the
release/1.3branch. It is tested on Kubernetes 1.17 with patches adapted from PR 64005 (kinvolk/kubernetes#3).The main changes are:
GetRuntimeConfigInforeturning the hard coded uid mapping with 100000WithRemappedSnapshotsnapshotterWithoutNamespace,WithUserNamespace,WithLinuxNamespaceat the right placesAt the OCI level (config.json), we have the following changes:
Demo:
Details
TODO:
NamespaceMode=NODE_WIDE_REMAPPEDcorrectly. At the moment,NamespaceMode=PODandNamespaceMode=NODEare implemented correctly./cc @mauriciovasquezbernal