Docker 的 Cgroup Driver 有什么区别?

首先从systemd说起,几年前各大发行版都开始转向systemd,这本身就与CGroup紧密结合起来。其实一个service文件,在她启动的时候就已经被定义了一个CGroup组,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Active: active (running) since Fri 2019-03-29 19:20:12 CST; 5min ago
Docs: https://docs.docker.com
Main PID: 3512 (dockerd)
Tasks: 39
Memory: 56.4M
CGroup: /system.slice/docker.service #路径在下面
├─3512 /usr/bin/dockerd
├─3526 docker-containerd --config /var/run/docker/containerd/containerd.toml
└─3725 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/b81788c8d662203bcd46acadf1d0d0c3ed9dfe76f15748585e040750d6cb683f -address /var/run/docker/conta...
ls /sys/fs/cgroup/systemd/system.slice/docker.service/
cgroup.clone_children cgroup.event_control cgroup.procs notify_on_release tasks

首先我们使用 docker run -d ngnix:alpine 来运行一个container,然后通过改变native.cgroupdriver=systemd/cgroupfs来观察其中的不同。systemd-cgls和systemd-cgtop是这次会用到的命令。

Cgroup Driver: systemd

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
systemd-cgls
├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
├─user.slice
│ └─user-0.slice
│ └─session-622.scope
│ ├─ 570 sshd: root@pts/0
│ ├─ 573 -bash
│ ├─2027 systemd-cgls
│ └─2028 less
└─system.slice
├─docker-b81788c8d662203bcd46acadf1d0d0c3ed9dfe76f15748585e040750d6cb683f.scope
│ ├─1904 nginx: master process nginx -g daemon off
│ └─1974 nginx: worker proces
├─docker.service
│ ├─1592 /usr/bin/dockerd
│ ├─1600 docker-containerd --config /var/run/docker/containerd/containerd.toml
│ └─1846 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/b81788c8d662203bcd46acadf1d0d0c3ed9dfe76f15748585e040750d6cb683f -address /var/run/docker/containerd/dock
ls /sys/fs/cgroup/systemd/system.slice/var-lib-docker-containers-b81788c8d662203bcd46acadf1d0d0c3ed9dfe76f15748585e040750d6cb683f-mounts-shm.mount/
cgroup.clone_children cgroup.event_control cgroup.procs notify_on_release tasks

Cgroup Driver: cgroupfs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
├─docker
│ └─b81788c8d662203bcd46acadf1d0d0c3ed9dfe76f15748585e040750d6cb683f
│ ├─2733 nginx: master process nginx -g daemon off
│ └─2805 nginx: worker proces
├─user.slice
│ └─user-0.slice
│ └─session-622.scope
│ ├─ 570 sshd: root@pts/0
│ ├─ 573 -bash
│ ├─2909 systemd-cgls
│ └─2911 less
└─system.slice
├─docker.service
│ ├─2474 /usr/bin/dockerd
│ ├─2482 docker-containerd --config /var/run/docker/containerd/containerd.toml
│ └─2714 docker-containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/b81788c8d662203bcd46acadf1d0d0c3ed9dfe76f15748585e040750d6cb683f -address /var/run/docker/containerd/dock
ls /sys/fs/cgroup/systemd/docker/b81788c8d662203bcd46acadf1d0d0c3ed9dfe76f15748585e040750d6cb683f/
cgroup.clone_children cgroup.event_control cgroup.procs notify_on_release tasks

可以看到主要不同在于CGroup组的位置,前者属于system组,后者被独立出来。

OK,那么这两种的优缺点呢?下面的内容摘抄自 https://github.com/coreos/bugs/issues/1435

Reasons to use cgroupfs

  • Better tested with docker (and related projects like k8s, cadvisor) due to being the default upstream driver ⭐️⭐️
  • Doesn’t mess up handling of init.scope in systemd 226+ (current issue in CoreOS alpha/beta) (runc issue, cadvisor issue)
  • First-supported for some Kubernetes features (e.g. pod-level cgroups)
  • Jessie Frazelle’s endorsement 🌟

(important reasons starred)

Reasons to use systemd

  • Backwards compatibility with the cgroup layout
  • Better integration with some tools (are there any worth noting? I’m specifically thinking of systemctl status but I’m not sure that actually does integrate better)
  • Belief that having systemd manage cgroups is fundamentally better than having dockerd/runc do so

如何列出宿主机的所有ns 和 cg ?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
lsns 来自util-linux包
NS TYPE NPROCS PID USER COMMAND
4026531836 pid 96 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026531837 user 98 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026531838 uts 96 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026531839 ipc 96 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026531840 mnt 94 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026531856 mnt 1 28 root kdevtmpfs
4026531956 net 96 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 22
4026532165 mnt 1 2847 chrony /usr/sbin/chronyd
4026532173 mnt 2 3773 root nginx: master process nginx -g daemon off
4026532174 uts 2 3773 root nginx: master process nginx -g daemon off
4026532175 ipc 2 3773 root nginx: master process nginx -g daemon off
4026532176 pid 2 3773 root nginx: master process nginx -g daemon off
4026532178 net 2 3773 root nginx: master process nginx -g daemon off

其实ip netns list也是可以的,但是docker默认在/var/run/docker/netns/存放着所有的容器ns信息,但是ip netns并不会读取这个目录,需要在以下目录建立软链接。

openat(AT_FDCWD, "/var/run/netns", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

为什么Ubuntu可以运行Centos的container?

回答这个问题前,应该先想一下Ubuntu和Centos的根本区别是什么?或许很多人会说,一个是apt一个是yum。但是如果把Ubuntu的ls命令拷贝到Centos上,它可以执行并输出吗?

答案是肯定的。使用strace lsstrace ubuntuls可以看到ls的命令都执行哪些操作。所以说一个Centos的容器也只是去调用宿主机内核的接口,经测试gvisor也是与宿主机同内核。

所有说docker image只是讲所有的依赖关系打包,但是在运行的时候仍然是去调用系统的相关接口,如果对应的Kernel接口有变化,你才能感觉到,而我这种做应用层或者框架编程的人几乎无感知。