[1.3] Implement user namespaces by alban · Pull Request #1 · kinvolk/containerd-cri

alban · 2020-06-18T11:12:50Z

This is the containerd/cri implementation for the Kubernetes Node-Level User Namespaces Design Proposal.

The patches are based on the release/1.3 branch. It is tested on Kubernetes 1.17 with patches adapted from PR 64005 (kinvolk/kubernetes#3).

The main changes are:

Import the new CRI API from Kubernetes PR 64005.
Implement GetRuntimeConfigInfo returning the hard coded uid mapping with 100000
Use the WithRemappedSnapshot snapshotter
Fix sysfs mount with correct ownership of netns (see commitmsg for details)
Additional mount restrictions on /dev/shm (nosuid, noexec, nodev)
Chown /dev/shm appropriately
Fix etc-hosts mounts with supplementary groups (see commitmsg for details)
Use custom options WithoutNamespace, WithUserNamespace, WithLinuxNamespace at the right places

At the OCI level (config.json), we have the following changes:

sandbox container with new "user" namespace and UidMappings
normal containers with "user" namespace from a path and UidMappings

Demo:

$ kubectl apply -f userns-tests/pod-simple.yaml 
pod/pod-simple created
$ kubectl apply -f userns-tests/pod-userns.yaml
pod/pod-userns created
$ kubectl exec -ti pod-simple -- /bin/sh -c 'cat /proc/self/uid_map'
         0          0 4294967295
$ kubectl exec -ti pod-userns -- /bin/sh -c 'cat /proc/self/uid_map'
         0     100000      65535

Details

userns-tests/pod-simple.yaml

apiVersion: v1
kind: Pod
metadata:
  name: pod-simple
  namespace: default
  annotations:
    alpha.kinvolk.io/userns: "disabled"
spec:
  containers:
  - name: pod-simple
    image: busybox
    command: ["sh"]
    args: ["-c", "sleep infinity"]

userns-tests/pod-userns.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: pod-userns
  namespace: default
  annotations:
    alpha.kinvolk.io/userns: "enabled"
spec:
  containers:
  - name: pod-userns
    image: busybox
    command: ["sh"]
    args: ["-c", "sleep infinity"]

TODO:

Make the uid mapping configurable instead of hard coding 100000.
Try on the master branch (presumably upstream wants an implementation on master first)
Tests.
Implement NamespaceMode=NODE_WIDE_REMAPPED correctly. At the moment, NamespaceMode=POD and NamespaceMode=NODE are implemented correctly.
Check the status of volumes.

/cc @mauriciovasquezbernal

…n, User Initial work by vikaschoudhary16 <vichoudh@redhat.com> on Kubernetes PR 64005.

With this patch, GetRuntimeConfigInfo() returns a config without user namespace: the only uid/gid mapping is the one from the root user namespace.

Before this patch, containerd created a netns, configured it with CNI, and then creates the sandbox container by giving the netns path previously setup. This means that the netns was owned by the host userns. Mounting sysfs in the container is restricted in this setup. This patch sets up the netns in the other way around instead: it creates the sandbox container, letting runc create a new netns. Then, it picks the new netns from /proc/$pid/ns/net, binds mount it in the usual CNI path and then gives it to CNI to configure. This means that the netns is owned by the userns of the sandbox container. In this way, mounting sysfs is possible. For more information about namespace ownership, see - man ioctl_ns - man user_namespaces, section "Interaction of user namespaces and other types of namespaces" - Linux commit 7dc5dbc879bd ("sysfs: Restrict mounting sysfs") torvalds/linux@7dc5dbc#diff-4839664cd0c8eab716e064323c7cd71fR1164 - net_current_may_mount() used for mounting sysfs: ns_capable(net->user_ns, CAP_SYS_ADMIN); https://github.com/torvalds/linux/blob/v5.7/net/core/net-sysfs.c#L1679

The sandbox container (aka "pause" container) has a tmpfs mount on /dev/shm. Bind mount it with nosuid, noexec, nodev because the mount would not be allowed in user namespaces otherwise.

runc needs to bind mount files from /var/lib/kubelet/pods/... (such as etc-hosts) into the container. When using user namespaces, the bind mount didn't work anymore when containerd is started from a systemd unit. This patch fixes that by adding SupplementaryGroups=0 runc needs to have permission on the directory to stat() the source of the bind mount. Without user namespaces, this is not a problem since runc is running as root, so it has 'rwx' permissions over the directory: drwxr-x---. 8 root root 4096 May 28 18:05 /var/lib/kubelet Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount phase happens before giving up the additional permissions. However, when using user namespaces, the runc process is belonging to a different user than root (depending on the mapping). /var/lib/kubelet is seen as belonging to the special unmapped user (65534, nobody). runc does not have 'rwx' permissions anymore but the empty '---' permission for 'other'. CAP_DAC_OVERRIDE is also no effective because the kernel performs the capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE). This checks that the owner of the /var/lib/kubelet is mapped in the current user namespace, which is not the case. Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was working when containerd was started manually with 'sudo' but not working when started from a systemd unit. The difference is how supplementary groups are handled between sudo and systemd units: systemd does not set supplementary groups by default. $ sudo grep -E 'Groups:|Uid:|Gid:' /proc/self/status Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: 0 $ sudo systemd-run -t grep -E 'Groups:|Uid:|Gid:' /proc/self/status Running as unit: run-u296886.service Press ^] three times within 1s to disconnect TTY. Uid: 0 0 0 0 Gid: 0 0 0 0 Groups: When runc has the supplementary group 0 configured, it is retained during the bind-mount phase, even though it is an unmapped group (runc temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc effectively has the 'r-x' permissions over /var/lib/kubelet. This makes the bind mount of etc-hosts work. After the mount phase, runc will set the credential correctly (following OCI's config.json specification), so the container will not retain this unmapped supplementary group. I rely on the systemd unit file being configured correctly with SupplementaryGroups=0 and I don't attempt to set it up automatically with syscall.Setgroups() because "at the kernel level, user IDs and group IDs are a per-thread attribute" (man setgroups) and the way Golang uses threads make it difficult to predict which thread is going to be used to execute runc. glibc's setgroup() is a wrapper that changes the credentials for all threads but Golang does not use the glibc implementation.

Example of possible configuration: ``` [plugins] [plugins."io.containerd.grpc.v1.cri"] [plugins."io.containerd.grpc.v1.cri".node_wide_uid_mapping] container_id = 0 host_id = 300000 size = 65536 [plugins."io.containerd.grpc.v1.cri".node_wide_gid_mapping] container_id = 0 host_id = 300000 size = 65536 ```

Return an error if NODE_WIDE_REMAPPED is requested

mauriciovasquezbernal

Some minor comments. I think we could remove NODE_WIDE_REMAPPED everywhere on this PR as we removed that on the kubelet one.

alban added 9 commits June 15, 2020 14:16

Add new CRI API, GetRuntimeConfigInfo() and a new LinuxNamepsaceOptio…

1fc856f

…n, User Initial work by vikaschoudhary16 <vichoudh@redhat.com> on Kubernetes PR 64005.

Implement GetRuntimeConfigInfo()

6b1955c

With this patch, GetRuntimeConfigInfo() returns a config without user namespace: the only uid/gid mapping is the one from the root user namespace.

Fix typo on RunPodSandbox

cb14152

Statically define a userns config

e4c50b8

Implement sandbox with user namespaces

2abe26c

User namespaces: use the correct snapshotter

717f607

User namespace: restrict mount options for sandbox /dev/shm (tmpfs)

986c403

The sandbox container (aka "pause" container) has a tmpfs mount on /dev/shm. Bind mount it with nosuid, noexec, nodev because the mount would not be allowed in user namespaces otherwise.

User namespaces: use the correct snapshotter

bc3308f

rata reviewed Jun 18, 2020

View reviewed changes

Comment thread pkg/server/sandbox_run.go Outdated

rata reviewed Jun 18, 2020

View reviewed changes

Comment thread pkg/server/sandbox_run.go Outdated

alban mentioned this pull request Jun 18, 2020

User namespace kinvolk/kubernetes#3

Draft

mauriciovasquezbernal reviewed Jun 19, 2020

View reviewed changes

Comment thread pkg/netns/netns.go

alban mentioned this pull request Jun 19, 2020

Add user namespace support containerd/cri#790

Closed

alban force-pushed the alban/userns branch from 24b67d4 to a5983e6 Compare June 22, 2020 09:57

alban added 4 commits June 24, 2020 14:31

User namespaces: implement NamespaceMode NODE and POD

023ad1e

Return an error if NODE_WIDE_REMAPPED is requested

Tests for user namespaces

53f33c2

Fixup move userns options together

3a93079

mauriciovasquezbernal reviewed Jun 30, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1.3] Implement user namespaces#1

[1.3] Implement user namespaces#1
alban wants to merge 14 commits intorelease/1.3from
alban/userns

alban commented Jun 18, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mauriciovasquezbernal left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alban commented Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mauriciovasquezbernal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alban commented Jun 18, 2020 •

edited

Loading