一次容器慢启动的“终极”排查，终于让我找到了罪魁祸首！

admin 10 月 20, 2023 143 0
近期内部进行 Containerd 替换 dockershim 测试，我们观测到业务容器在启动后需要很长时间才能正常运行。其中 init 容器在一秒内完成，但是容器有时需要超过一分钟才开始执行。
查看 kubelet 日志，发现当节点刚启动时，PLEG 的 Relisting 方法（通常每秒运行一次）需要超过30秒才能完成。但是过来了几分钟后，这个现象就消失了，最终“Relisting”方法按预期的每秒一次的频率运行。
这个观察结果表明，在节点刚启动时可能存在一些问题，导致容器执行延迟，但随着节点运行，这个问题会自动消失，容器最终都正常运行。

kubelet_init[1680]: I0602 05:48:06.350695    1679 generic.go:191] "GenericPLEG: Relisting"kubelet_init[1680]: I0602 05:48:07.352436    1679 generic.go:191] "GenericPLEG: Relisting"kubelet_init[1680]: I0602 05:48:49.669827    1679 generic.go:191] "GenericPLEG: Relisting"kubelet_init[1680]: I0602 05:48:50.672822    1679 generic.go:191] "GenericPLEG: Relisting"kubelet_init[1680]: I0602 05:48:51.676857    1679 generic.go:191] "GenericPLEG: Relisting"

其中 containerd 的配置 config.toml 如下所示：
version = 2root = "/var/lib/containerd"state = "/run/containerd"
[grpc]address = "/run/containerd/containerd.sock"
[plugins."io.containerd.grpc.v1.cri".containerd]default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".registry]config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]bin_dir = "/opt/cni/bin"conf_dir = "/etc/cni/net.d"


NOTE：Containerd: 1.6.21，K8s: 1.19，Kernel: 5.15.0

01 技术背景
dockershim 与 CRI
2022年05月，Kubernetes 1.24 正式发布，比较引人注意是在这个版本中正式将dockershim 组件从 kubelet 中移除。
从这个版本开始，用户使用 Kubelet 时可以优先选择 containerd 或 CRI-O 作为容器运行时，其中 Containerd 架构演变如下所示，如需了解更多，参看Containerd深度剖析-runtime篇

PLEG
PLEG（Pod Lifecycle Event Generator），即Pod 生命周期事件生成器。在 Kubernetes 中（分布式架构），每个节点上都运行着一个守护进程 Kubelet 来管理节点上的容器，适时调整容器的实际状态以匹配 spec 中定义的状态（最终一致性）。
具体来说，Kubelet 需要对 pod 两个地方的变更做出及时的操作：


Pod spec 中定义的状态


容器运行时的状态


对于 Pod来说，Kubelet 会从多个数据来源（api、file以及http） watch Pod spec 中的变化。对于容器来说，Kubelet 会定期轮询容器运行时，以获取所有容器的最新状态。随着 Pod 和容器数量的增加，轮询会产生较大开销，带来的周期性大量并发请求会导致较高的 CPU 使用率峰值，降低节点性能，从而降低节点的可靠性。为了降低 Pod 的管理开销，提升 Kubelet 的性能和可扩展性，引入了 PLEG，改进了之前的工作方式：


减少空闲期间的不必要工作（例如 Pod 的定义和容器的状态没有发生更改）。


减少获取容器状态的并发请求数量。




注意：为了进一步降低损耗，社区推出了基于 event 实现的 PLEG，当然也需要 CRI 运行时支持

ImagePull

ctr image pull 对应的步骤如下：


resolve用户需要下载的镜像


从 registry pull 镜像，把镜像层内容和 config 保存进 content 服务，把镜像相关的元数据保存进 images 元数据服务


unpack进snapshot服务



注意：这里的 content、images 等服务都是指 containerd 提供的 gRPC 服务，在解压每个 layer 时，containerd 会临时挂载/卸载所有父快照。

02 问题排除
基于上述的技术背景，可以知道 GenericPLEG:Relisting 调用了 containerd 的 CRI（Container Runtime Interface）来获取所有正在运行的容器列表。



















// relist queries the container runtime for list of pods/containers, compare// with the internal pods/containers, and generates events accordingly.func (g *GenericPLEG) relist() {  klog.V(5).Infof("GenericPLEG: Relisting")
  if lastRelistTime := g.getRelistTime(); !lastRelistTime.IsZero() {    metrics.PLEGRelistInterval.Observe(metrics.SinceInSeconds(lastRelistTime))  }
  timestamp := g.clock.Now()  defer func() {    metrics.PLEGRelistDuration.Observe(metrics.SinceInSeconds(timestamp))  }()
  // Get all the pods.  podList, err := g.runtime.GetPods(true)

那么相应的，我们可以排查 containerd 日志，其日志表明 containerd 本身或者 shim 在停止容器时遇到了问题。
containerd[2206]: {"error":"failed to stop container: failed to delete task: context deadline exceeded: unknown","level":"error","msg":"failed to handle container TaskExit event u0026TaskExit{ContainerID:8f71eadd772ffde7f9537998844aa3f7a8a4f15ec182a5943ef0d60f242b1929,ID:8f71eadd772ffde7f9537998844aa3f7a8a4f15ec182a5943ef0d60f242b1929,Pid:3658,ExitStatus:0,ExitedAt:2023-06-15 06:39:36.566238047 +0000 UTC,XXX_unrecognized:[],}","time":"2023-06-15T06:39:46.568049942Z"}

因此获取在 goroutine dump日志，从中可以发现一个 goroutine 正在等待 Delete 调用。
goroutine 1654 [select]:github.com/containerd/ttrpc.(*Client).dispatch(0xc001ae1780, {0x56190174ac08?, 0xc000ae0180}, 0xc000ae05a0, 0xc0001183c0)  /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:163 +0x2adgithub.com/containerd/ttrpc.defaultClientInterceptor({0x56190174ac08?, 0xc000ae0180?}, 0x5619023eb940?, 0xc000f982a0?, 0x6993c8a9f?, 0x5619023eb940?)  /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/interceptor.go:49 +0x26github.com/containerd/ttrpc.(*Client).Call(0xc001ae1780, {0x56190174ac08, 0xc000ae0180}, {0x561900ea8b7f, 0x17}, {0x561900e8851e, 0x6}, {0x56190167d8a0?, 0xc000e6a880?}, {0x56190167d9e0, ...})  /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:132 +0x3acgithub.com/containerd/containerd/runtime/v2/task.(*taskClient).Delete(0xc00011a578, {0x56190174ac08, 0xc000ae0180}, 0x5618fff5830f?)  /go/src/github.com/containerd/containerd/runtime/v2/task/shim.pb.go:3609 +0x98github.com/containerd/containerd/runtime/v2.(*shimTask).delete(0xc00165e870, {0x56190174ac08, 0xc000ae0180}, 0xc0009ab988)  /go/src/github.com/containerd/containerd/runtime/v2/shim.go:275 +0xc2github.com/containerd/containerd/runtime/v2.(*TaskManager).Delete(0xc00040a0f8, {0x56190174ac08, 0xc000ae0180}, {0xc0013aaa40?, 0x40?})  /go/src/github.com/containerd/containerd/runtime/v2/manager.go:419 +0x8cgithub.com/containerd/containerd/services/tasks.(*local).Delete(0xc00070a5a0, {0x56190174ac08, 0xc000ae0180}, 0xc000d7cd50, {0x561901620fe0?, 0xc0009abab8?, 0x5618ffbcf13f?})  /go/src/github.com/containerd/containerd/services/tasks/local.go:291 +0x11agithub.com/containerd/containerd.(*task).Delete(0xc0006ba440, {0x56190174ac08, 0xc000ae0180}, {0xc000d08570, 0x2, 0x56190174b1f0?})  /go/src/github.com/containerd/containerd/task.go:332 +0x355github.com/containerd/containerd/pkg/cri/server.handleContainerExit({0x56190174ac08, 0xc000ae0180}, 0xc000ae0120, {{{0xc0013aaa40, 0x40}, {0xc000dfd7a0, 0x52}, {0xc001a8e240, 0x40}, 0xc001240380, ...}, ...})  /go/src/github.com/containerd/containerd/pkg/cri/server/events.go:387 +0x202github.com/containerd/containerd/pkg/cri/server.(*eventMonitor).startContainerExitMonitor.func1.1(0xc000089240, 0xc000ae0120)  /go/src/github.com/containerd/containerd/pkg/cri/server/events.go:190 +0x310github.com/containerd/containerd/pkg/cri/server.(*eventMonitor).startContainerExitMonitor.func1()  /go/src/github.com/containerd/containerd/pkg/cri/server/events.go:198 +0x37fcreated by github.com/containerd/containerd/pkg/cri/server.(*eventMonitor).startContainerExitMonitor  /go/src/github.com/containerd/containerd/pkg/cri/server/events.go:162 +0x12b

在 containerd-shim-runc-v2 中可以发现一个 goroutine 挂在了 umount 系统调用上
goroutine 104 [syscall]:syscall.Syscall(0x7c?, 0xc000161800?, 0xc0000c2200?, 0x7d?)        /usr/local/go/src/syscall/syscall_linux.go:68 +0x27golang.org/x/sys/unix.Unmount({0xc0000c2200?, 0x0?}, 0xc00029c448?)        /go/src/github.com/containerd/containerd/vendor/golang.org/x/sys/unix/zsyscall_linux.go:1680 +0x7dgithub.com/containerd/containerd/mount.unmount({0xc0000c2200, 0x7c}, 0x9ed280?)        /go/src/github.com/containerd/containerd/mount/mount_linux.go:155 +0x8cgithub.com/containerd/containerd/mount.UnmountAll({0xc0000c2200, 0x7c}, 0xc000065080?)        /go/src/github.com/containerd/containerd/mount/mount_linux.go:185 +0x8cgithub.com/containerd/containerd/pkg/process.(*Init).delete(0xc0000d2c60, {0x9f1460, 0xc000065080})        /go/src/github.com/containerd/containerd/pkg/process/init.go:312 +0x13agithub.com/containerd/containerd/pkg/process.(*stoppedState).Delete(0xc000012ab8, {0x9f1460?, 0xc000065080?})        /go/src/github.com/containerd/containerd/pkg/process/init_state.go:396 +0x2egithub.com/containerd/containerd/pkg/process.(*Init).Delete(0xc00002d180?, {0x9f1460?, 0xc000065080?})        /go/src/github.com/containerd/containerd/pkg/process/init.go:288 +0xb6github.com/containerd/containerd/runtime/v2/runc.(*Container).Delete(0xc000132380?, {0x9f1460, 0xc000065080}, 0xc0003071c0)        /go/src/github.com/containerd/containerd/runtime/v2/runc/container.go:390 +0x62github.com/containerd/containerd/runtime/v2/runc/task.(*service).Delete(0xc000132380, {0x9f1460, 0xc000065080}, 0xc0003071c0)        /go/src/github.com/containerd/containerd/runtime/v2/runc/task/service.go:216 +0x75github.com/containerd/containerd/runtime/v2/task.RegisterTaskService.func4({0x9f1460, 0xc000065080}, 0xc00002ee00)        /go/src/github.com/containerd/containerd/runtime/v2/task/shim.pb.go:3477 +0x98github.com/containerd/ttrpc.defaultServerInterceptor({0x9f1460?, 0xc000065080?}, 0x17?, 0xc000134760?, 0x6?)        /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/interceptor.go:45 +0x26github.com/containerd/ttrpc.(*serviceSet).dispatch(0xc000118650, {0x9f1460, 0xc000065080}, {0xc00002aba0, 0x17}, {0xc00027e22a, 0x6}, {0xc000180280, 0x42, 0x50})        /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:95 +0x1begithub.com/containerd/ttrpc.(*serviceSet).call(0xd0e330?, {0x9f1460?, 0xc000065080?}, {0xc00002aba0?, 0x84eac0?}, {0xc00027e22a?, 0xc000308000?}, {0xc000180280, 0x42, 0x50})        /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/services.go:64 +0x71github.com/containerd/ttrpc.(*serverConn).run.func2(0x1b)        /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:439 +0xe5created by github.com/containerd/ttrpc.(*serverConn).run        /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/server.go:435 +0x6b1

这时候，我们的排查思路可以回到 containerd.log，通过查看 containerd.log，可以发现 UpdateContainerResources 请求，在等待 bbolt 锁：
goroutine 1723 [semacquire]:sync.runtime_SemacquireMutex(0xc001a78300?, 0x9?, 0xc001500c00?)  /usr/local/go/src/runtime/sema.go:77 +0x25sync.(*Mutex).lockSlow(0xc0002f9178)  /usr/local/go/src/sync/mutex.go:171 +0x165sync.(*Mutex).Lock(...)  /usr/local/go/src/sync/mutex.go:90go.etcd.io/bbolt.(*DB).beginRWTx(0xc0002f8fc0)  /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:640 +0x77go.etcd.io/bbolt.(*DB).Begin(0xc001739a70?, 0x87?)  /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:589 +0x1dgo.etcd.io/bbolt.(*DB).Update(0x20?, 0xc000b44be0)  /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:725 +0x3egithub.com/containerd/containerd/metadata.(*DB).Update(0xc0001e92d0, 0x56190174ac40?)  /go/src/github.com/containerd/containerd/metadata/db.go:246 +0x9egithub.com/containerd/containerd/services/containers.(*local).withStoreUpdate(0xc0003e8840, {0x56190174ac40?, 0xc0017390b0?}, 0x0?)  /go/src/github.com/containerd/containerd/services/containers/local.go:212 +0x3cgithub.com/containerd/containerd/services/containers.(*local).Update(0xc0003e8840, {0x56190174ac40, 0xc0017390b0}, 0xc000eb0540, {0xedc1ca928?, 0x0?, 0xc0017399b0?})  /go/src/github.com/containerd/containerd/services/containers/local.go:156 +0x24egithub.com/containerd/containerd.(*remoteContainers).Update(_, {_, _}, {{0xc00136b280, 0x40}, 0xc0017398f0, {0xc0016cc280, 0x96}, {{0xc00145a9f0, 0x15}, ...}, ...}, ...)  /go/src/github.com/containerd/containerd/containerstore.go:129 +0x2cegithub.com/containerd/containerd.(*container).Update(0xc0006e4a80, {0x56190174ac40, 0xc0017390b0}, {0xc00040a0f0, 0x1, 0xc0016f0101?})  /go/src/github.com/containerd/containerd/container.go:318 +0x214github.com/containerd/containerd/pkg/cri/server.updateContainerSpec({0x56190174ac40, 0xc0017390b0}, {0x561901754a58, 0xc0006e4a80}, 0xc0016f4b80)

而这部分代码的实现位于：
containerd/pkg/cri/server/container_update_resources.go
func (c *criService) UpdateContainerResources(ctx context.Context, r *runtime.UpdateContainerResourcesRequest) (retRes *runtime.UpdateContainerResourcesResponse, retErr error) {    container, err := c.containerStore.Get(r.GetContainerId())    if err != nil {      return nil, fmt.Errorf("failed to find container: %w", err)    }    // Update resources in status update transaction, so that:    // 1) There won't be race condition with container start.    // 2) There won't be concurrent resource update to the same container.    if err := container.Status.UpdateSync(func(status containerstore.Status) (containerstore.Status, error) {      return c.updateContainerResources(ctx, container, r, status)    }); err != nil {      return nil, fmt.Errorf("failed to update resources: %w", err)    }    return &runtime.UpdateContainerResourcesResponse{}, nil

UpdateContainerResources 持有容器状态的锁，而 ListContainers 需要使用该锁，这就导致 PLEG （Pod Lifecycle Event Generator）花费很长时间。
goroutine 1665 [semacquire]:sync.runtime_SemacquireMutex(0xe69?, 0xae?, 0x1768c1d3f53d5f79?)  /usr/local/go/src/runtime/sema.go:77 +0x25sync.(*RWMutex).RLock(...)  /usr/local/go/src/sync/rwmutex.go:71github.com/containerd/containerd/pkg/cri/store/container.(*statusStorage).Get(0xc00058a880?)  /go/src/github.com/containerd/containerd/pkg/cri/store/container/status.go:205 +0x9cgithub.com/containerd/containerd/pkg/cri/server.toCRIContainer({{{0xc00136b280, 0x40}, {0xc00186b6d0, 0x4d}, {0xc001a8e6c0, 0x40}, 0xc0007129a0, {0xc00050eb40, 0x47}, {0xc000dfcd80, ...}, ...}, ...})  /go/src/github.com/containerd/containerd/pkg/cri/server/container_list.go:48 +0x3fgithub.com/containerd/containerd/pkg/cri/server.(*criService).ListContainers(0xc000548400, {0xc0006027b0?, 0x561900e89376?}, 0xc000ef6130)  /go/src/github.com/containerd/containerd/pkg/cri/server/container_list.go:37 +0x205github.com/containerd/containerd/pkg/cri/server.(*instrumentedService).ListContainers(0xc000708168, {0x56190174ac40, 0xc0006027b0}, 0xc000ef6130)  /go/src/github.com/containerd/containerd/pkg/cri/server/instrumented_service.go:515 +0x1d3

UpdateContainerResource 持有 bbolt 数据库的锁，其正在尝试将数据同步到存储中，因此我们猜测此时主机 I/O 存在压力，如果启用 cgroup v2，可以通过查看 io pressure 文件（后续我们介绍一下 PSI），或者通过 iostat 命令。
goroutine 1268 [syscall]:syscall.Syscall(0x0?, 0x0?, 0xc000984000?, 0x686?)  /usr/local/go/src/syscall/syscall_linux.go:68 +0x27syscall.Fsync(0x0?)  /usr/local/go/src/syscall/zsyscall_linux_amd64.go:423 +0x2ainternal/poll.(*FD).Fsync.func1(...)  /usr/local/go/src/internal/poll/fd_fsync_posix.go:18internal/poll.ignoringEINTR(...)  /usr/local/go/src/internal/poll/fd_posix.go:74internal/poll.(*FD).Fsync(0x0?)  /usr/local/go/src/internal/poll/fd_fsync_posix.go:17 +0xf4os.(*File).Sync(0xc00011a6b8)  /usr/local/go/src/os/file_posix.go:168 +0x4egithub.com/containerd/containerd/content/local.(*writer).Commit(0xc0008c7880, {0x56190174ac40, 0xc0006028a0}, 0x181ebe95, {0xc001476780, 0x47}, {0x0, 0x0, 0x5619023eb940?})  /go/src/github.com/containerd/containerd/content/local/writer.go:95 +0x24egithub.com/containerd/containerd/metadata.(*namespacedWriter).commit(0xc0006f5320, {0x56190174ac40, 0xc0006028a0}, 0x0?, 0x181ebe95, {0xc001476780, 0x47}, {0x0, 0x0, 0x0})  /go/src/github.com/containerd/containerd/metadata/content.go:641 +0x5bfgithub.com/containerd/containerd/metadata.(*namespacedWriter).Commit.func1(0x0?)  /go/src/github.com/containerd/containerd/metadata/content.go:578 +0x8fgo.etcd.io/bbolt.(*DB).Update(0x0?, 0xc000bb01e0)  /go/src/github.com/containerd/containerd/vendor/go.etcd.io/bbolt/db.go:741 +0x82github.com/containerd/containerd/metadata.(*DB).Update(0xc0001e92d0, 0x5619014f3000?)  /go/src/github.com/containerd/containerd/metadata/db.go:246 +0x9egithub.com/containerd/containerd/metadata.update({0x56190174ac40?, 0xc0006028a0?}, {0x561901738b90, 0xc0001e92d0}, 0xc000

为了验证上述排查思路，我们换了一台节点，移除 dockershim，重启机器后，发现大量 image pull（因为 containerd 存储的 image 与 docker 是不一致，所有 kubelet 启动容器时需要拉取镜像），这时候通过 iostat 发现 IO 非常高，原因是什么？
对于频繁的镜像 pull和 unpack，导致磁盘争用的问题，正如技术背景中提到的 ImagePull 的过程，虽然 ImagePull 不会强制刷新所有脏页到磁盘，但 umount 会这样做。
如何解决这个问题？
我们只需要避免 umount 时的强制刷新，即使有再多镜像拉取，这种现象也能得到缓解。
03 问题修复
我们在社区找到了相应的修复方案，PR#8676 为 overlay 插件添加 mount option，以使用 volatile，该特性需要 overlay 内核存储的支持，https://docs.kernel.org/filesystems/overlayfs.html#volatile-mount。
注：volatile 可以实现跳过 sync 调用，这样 umount 不会因为同步到磁盘而出现卡顿。

由于笔者时间、视野、认知有限，本文难免出现错误、疏漏等问题，期待各位读者朋友、业界专家指正交流，上述排障信息已修改为社区内容。

来源：本文转自公众号 DCOS，点击查看原文。

10月26-27日，GOPS 全球运维大会 2023 · 上海站，来自腾讯、阿里、字节、农行、交行、申万宏源等近 90 位大咖齐聚 GOPS，为您分享自动化、DevOps、AIOps、SRE 等您关心的运维话题，还有大型企业数字化转型案例等您来！

 点击阅读原文，访问大会官网 
近期好文：
看完这 100 个网络基础知识，你会成网络大牛~
“高效运维”公众号诚邀广大技术人员投稿
投稿邮箱：jiachen@greatops.net，或添加联系人微信：greatops1118。






点个“在看”，一年不宕机