Kubernetes的Pod无法删除,glusterfs导致docker无响应,集群雪崩

Tags: kubernetes_problem 

目录

结论

该问题的调查过程相当曲折,这里只记录了Pod无法删除的调查过程。最终调查发现,下面这三个问题都是因为容器挂载了glusterfs导致的:

  1. pod无法删除
  2. docker ps无响应
  3. kubelet异常node突然不可用,并且发生雪崩式扩散,十几分钟内几十台故障

因为历史遗留问题,一些容器都挂载了glusterfs,这些容器中的一部分突然故障,随后被重新调度,漂移到哪个node,哪个node就随之崩溃,同时发现这些容器中的一部分一直在Terminating状态,无法删除。

通过分析故障node的上日志,发现不可用的node在创建pod时,首先没有成功创建要挂载glusterfs的容器,随即docker ps无响应,最后导致node变成NotReady状态。将glusterfs卸载(进程杀死)之后,删不掉的容器被成功删除,docker ps也有响应了。

现象

Pod删除失败,一直在Terminating状态,describe信息如下:

Events:
  Type     Reason                 Age                From                      Message
  ----     ------                 ----               ----                      -------
  Normal   Killing                11m (x35 over 1h)  kubelet, kube-cluster-node-xxx  Killing container with id docker://xxxx-xxxxx-index-task:Need to kill Pod
  Normal   SuccessfulMountVolume  8m                 kubelet, kube-cluster-node-xxx  MountVolume.SetUp succeeded for volume "xxxx-xxxxx-index-task-volume-log"
  Normal   SuccessfulMountVolume  8m                 kubelet, kube-cluster-node-xxx  MountVolume.SetUp succeeded for volume "xxxx-xxxxx-index-task-volume-custom"
  Normal   SuccessfulMountVolume  8m                 kubelet, kube-cluster-node-xxx  MountVolume.SetUp succeeded for volume "default-token-xxt6c"
  Normal   Killing                9s (x4 over 6m)    kubelet, kube-cluster-node-xxx  Killing container with id docker://xxxx-xxxxx-index-task:Need to kill Pod
  Warning  FailedKillPod          9s (x4 over 6m)    kubelet, kube-cluster-node-xxx  error killing pod: failed to "KillContainer" for "xxxx-xxxxx-index-task" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
  Warning  FailedPreStopHook      8s (x5 over 8m)    kubelet, kube-cluster-node-xxx  Exec lifecycle hook ([/bin/sleep 30]) for Container "xxxx-xxxxx-index-task" in Pod "xxxx-xxxxx-index-task-569f775985-2qv8x_xxx-xxxx(2b8e5882-5146-11e9-b3b3-525400dd6f19)" failed - error: command '/bin/sleep 30' exited with 126: , message: "rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:240: creating new parent process caused \"container_linux.go:1254: running lstat on namespace path \\\"/proc/30849/ns/ipc\\\" caused \\\"lstat /proc/30849/ns/ipc: no such file or directory\\\"\"\n\r\n"

调查

Kubelet中显示删除失败,调用preStop的命令失败,和describe中的一致:

Apr 08 13:33:44 kube-cluster-node-xxx kubelet[27666]: E0408 13:33:44.466398   27666 remote_runtime.go:229] StopContainer "17619dcf545c0936b5a5bad416c1fe50064547f49ba3f34721baf4201f242c1b" from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Apr 08 13:33:44 kube-cluster-node-xxx kubelet[27666]: E0408 13:33:44.466507   27666 kuberuntime_container.go:604] Container "docker://17619dcf545c0936b5a5bad416c1fe50064547f49ba3f34721baf4201f242c1b" termination failed with gracePeriod 30: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Apr 08 13:33:44 kube-cluster-node-xxx kubelet[27666]: E0408 13:33:44.468600   27666 kubelet.go:1532] error killing pod: failed to "KillContainer" for "xxxx-xxxxx-index-task" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
Apr 08 13:33:44 kube-cluster-node-xxx kubelet[27666]: E0408 13:33:44.468641   27666 pod_workers.go:186] Error syncing pod 2b8e5882-5146-11e9-b3b3-525400dd6f19 ("xxxx-xxxxx-index-task-569f775985-2qv8x_xxx-xxxx(2b8e5882-5146-11e9-b3b3-525400dd6f19)"), skipping: error killing pod: failed to "KillContainer" for "xxxx-xxxxx-index-task" with KillContainerError: "rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
Apr 08 13:33:45 kube-cluster-node-xxx kubelet[27666]: E0408 13:33:45.099777   27666 kuberuntime_container.go:495] preStop hook for container "xxxx-xxxxx-index-task" failed: command '/bin/sleep 30' exited with 126:

Docker日志显示,exec命令执行失败:

Apr 08 13:31:09 kube-cluster-node-xxx dockerd[10628]: time="2019-04-08T13:31:09.609938386+08:00" level=info msg="Container 17619dcf545c failed to exit within 10 seconds of kill - trying direct SIGKILL"
Apr 08 13:31:44 kube-cluster-node-xxx dockerd[10628]: time="2019-04-08T13:31:44.464673647+08:00" level=error msg="Error running exec in container: rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:240: creating new parent process caused \"container_linux.go:1254: running lstat on namespace path \\\"/proc/30849/ns/ipc\\\" caused \\\"lstat /proc/30849/ns/ipc: no such file or directory\\\"\"\n"
Apr 08 13:32:14 kube-cluster-node-xxx dockerd[10628]: time="2019-04-08T13:32:14.466737962+08:00" level=info msg="Container 17619dcf545c0936b5a5bad416c1fe50064547f49ba3f34721baf4201f242c1b failed to exit within 30 seconds of signal 15 - using the force"
Apr 08 13:32:24 kube-cluster-node-xxx dockerd[10628]: time="2019-04-08T13:32:24.467687487+08:00" level=info msg="Container 17619dcf545c failed to exit within 10 seconds of kill - trying direct SIGKILL"

观察无法删除的容器的状态

执行docker rm -f 17619dcf54,长时间没有响应,同时docker ps无响应。尝试用docker-runc直接删除,不成功:

$ docker-runc   delete -f  17619dcf545c0936b5a5bad416c1fe50064547f49ba3f34721baf4201f242c1b
kill container 17619dcf545c0936b5a5bad416c1fe50064547f49ba3f34721baf4201f242c1b: container init still running
one or more of the container deletions failed

$ docker ps |grep 17619dc
17619dcf545c        xxxx.xxxx.com/hindex/xxxx-xxxxx-index-task      "./entrypoint.sh /..."   10 days ago         Up 10 days                              k8s_xxxx-xxxxx-index-task_xxxx-xxxxx-index-task-569f775985-2qv8x_xxx-xxxx_2b8e5882-5146-11e9-b3b3-525400dd6f19_0

尝试进入无法删除的容器,失败,输出信息与docker中的日志一致:

$ docker exec -it 17619dcf545c /bin/sh
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:240: creating new parent process caused "container_linux.go:1254: running lstat on namespace path \"/proc/30849/ns/ipc\" caused \"lstat /proc/30849/ns/ipc: no such file or directory\""

使用docker-runc也不行:

$ docker-runc list  |grep 17619d
17619dcf545c0936b5a5bad416c1fe50064547f49ba3f34721baf4201f242c1b   31238       running     /run/docker/libcontainerd/17619dcf545c0936b5a5bad416c1fe50064547f49ba3f34721baf4201f242c1b   2019-03-28T10:42:35.848580074Z

$ docker-runc ps 17619dcf545c0936b5a5bad416c1fe50064547f49ba3f34721baf4201f242c1b
UID        PID  PPID  C STIME TTY          TIME CMD
root     31238 31221  0 Mar28 ?        00:57:40 [java] <defunct>
root     31284 31238  0 Mar28 ?        00:00:00 /usr/sbin/sshd

$ docker-runc  exec -t 17619dcf545c0936b5a5bad416c1fe50064547f49ba3f34721baf4201f242c1b ps
exec failed: container_linux.go:240: creating new parent process caused "container_linux.go:1254: running lstat on namespace path \"/proc/30849/ns/ipc\" caused \"lstat /proc/30849/ns/ipc: no such file or directory\""

调查namespace

用exec尝试进入容器的时候,提示namespace文件(lstat /proc/30849/ns/ipc)不存在,但是30849这个进程不存在。

找到存活容器内的进程:

$ docker-runc  ps  17619dcf545c0936b5a5bad416c1fe50064547f49ba3f34721baf4201f242c1b
UID        PID  PPID  C STIME TTY          TIME CMD
root     31238     1  0 Mar28 ?        00:57:40 [java] <defunct>

该进程变成了僵尸进程:

$ ps aux|grep Zsl
root     17258  0.0  0.0 112636  2060 pts/0    R+   14:19   0:00 grep --color=auto Zsl
root     31238  0.3  0.0      0     0 ?        Zsl  Mar28  57:40 [java] <defunct>

对应的ns也是残缺的:

$ ls /proc/31238/ns
ls: cannot read symbolic link net: No such file or directory
ls: cannot read symbolic link uts: No such file or directory
ls: cannot read symbolic link ipc: No such file or directory
ls: cannot read symbolic link mnt: No such file or directory
total 0
lrwxrwxrwx 1 root root 0 Apr  8 14:21 ipc
lrwxrwxrwx 1 root root 0 Mar 28 18:42 mnt
lrwxrwxrwx 1 root root 0 Apr  8 14:21 net
lrwxrwxrwx 1 root root 0 Mar 28 18:42 pid -> pid:[4026532484]
lrwxrwxrwx 1 root root 0 Apr  8 14:21 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Mar 28 18:42 uts

这时候发现该容器对应的pause容器早已经被删除。

最后如本文开头所讲,卸载glusterfs之后,这些容器随即被成功删除,将glusterfs进程杀死之后,docker ps也能够响应了。


kubernetes_problem

  1. kubernetes ingress-nginx 启用 upstream 长连接,需要注意,否则容易 502
  2. kubernetes ingress-nginx 的 canary 影响指向同一个 service 的所有 ingress
  3. ingress-nginx 启用 tls 加密,配置了不存在的证书,导致 unable to get local issuer certificate
  4. https 协议访问,误用 http 端口,CONNECT_CR_SRVR_HELLO: wrong version number
  5. Kubernetes ingress-nginx 4 层 tcp 代理,无限重试不存在的地址,高达百万次
  6. Kubernetes 集群中个别 Pod 的 CPU 使用率异常高的问题调查
  7. Kubernetes 集群 Node 间歇性变为 NotReady 状态: IO 负载高,延迟严重
  8. Kubernetes的nginx-ingress-controller刷新nginx的配置滞后十分钟导致504
  9. Kubernetes的Nginx Ingress 0.20之前的版本,upstream的keep-alive不生效
  10. Kubernetes node 的 xfs文件系统损坏,kubelet主动退出且重启失败,恢复后无法创建pod
  11. Kubernetes的Pod无法删除,glusterfs导致docker无响应,集群雪崩
  12. Kubernetes集群node无法访问service: kube-proxy没有正确设置cluster-cidr
  13. Kubernetes集群node上的容器无法ping通外网: iptables snat规则缺失导致
  14. Kubernetes问题调查: failed to get cgroup stats for /systemd/system.slice
  15. Kubelet1.7.16使用kubeconfig时,没有设置--require-kubeconfig,导致node不能注册
  16. Kubelet从1.7.16升级到1.9.11,Sandbox以外的容器都被重建的问题调查
  17. Kubernetes: 内核参数rp_filter设置为Strict RPF,导致Service不通
  18. Kubernetes使用过程中遇到的一些问题与解决方法
  19. Kubernetes集群节点被入侵挖矿,CPU被占满
  20. kubernetes的node上的重启linux网络服务后,pod无法联通
  21. kubernetes的pod因为同名Sandbox的存在,一直无法删除
  22. kubelet升级,导致calico中存在多余的workloadendpoint,node上存在多余的veth设备
  23. 使用petset创建的etcd集群在kubernetes中运行失败
  24. Kubernetes 容器启动失败: unable to create nf_conn slab cache
  25. 未在calico中创建hostendpoint,导致开启隔离后,在kubernetes的node上无法访问pod
  26. calico分配的ip冲突,pod内部arp记录丢失,pod无法访问外部服务
  27. kubernetes的dnsmasq缓存查询结果,导致pod偶尔无法访问域名
  28. k8s: rbd image is locked by other nodes
  29. kuberntes的node无法通过物理机网卡访问Service

推荐阅读

Copyright @2011-2019 All rights reserved. 转载请添加原文连接,合作请加微信lijiaocn或者发送邮件: [email protected],备注网站合作

友情链接:  系统软件  程序语言  运营经验  水库文集  网络课程  微信网文  发现知识星球