通过 pid 查找 container_ID

场景

我的问题是,节点负载很高(cpu/mem),一时半会定位不到问题所在,且监控层面看到文件句柄数很高18w+,因此通过 pidstat 查看下进程利用率(内存相关),因为之前出现过主机内存 90+% 的利用率居高不下

命令

# 查找占用 40g 以上的虚拟内存的 pid
# pidstat -rsl 1 3 | awk 'NR>1{if($7>40000000) print}'

# pidstat -rsl 1 3 | awk 'NR>1{if($7>10000) print}'
# output
11时42分28秒   UID       PID  minflt/s  majflt/s     VSZ    RSS   %MEM StkSize  StkRef  Command
11时42分30秒     0   2738052      1.54      0.00 1254560 484040   3.99    136      24  kube-apiserver --advertise-address=10.29.15.79 --allow-privileged=true --anonymous-auth=True --apiserver-count=1 --audit-log-ma
平均时间:   UID       PID  minflt/s  majflt/s     VSZ    RSS   %MEM StkSize  StkRef  Command
平均时间:     0   2738052      1.54      0.00 1254560 484040   3.99    136      24  kube-apiserver --advertise-address=10.29.15.79 --allow-privileged=true --anonymous-auth=True --apiserver-count=1 --audit-log-ma

# lsof -p $PID | wc -l  # 可以使用 lsof 来统计该 pid 的文件句柄数集信息,我的环境是打开了 8w+,kill 掉其主进程后内存和 cpu利用有所降低,之所以有 8w+,是因为其中一部分文件删除后没有自动释放掉导致

# 通过 pid 查找其是否有 ppid
# ps -elF | grep 2738052 | grep -v grep 
# output
4 S root     2738052 2737683 18  80   0 - 313640 futex_ 451296 5 1月02 ?       15:09:26 kube-apiserver --advertise-address=10.29.15.79 --allow-privileged=true --anonymous-auth=True --apiserver-count=1 --audit-log-maxage=30 --audit-log-maxbackup=1 --audit-log-maxsize=100 --audit-log-path=/var/log/audit/kube-apiserver-audit.log --audit-policy-file=/etc/kubernetes/audit-policy/apiserver-audit-policy.yaml --authorization-mode=Node,RBAC --bind-address=0.0.0.0 --client-ca-file=/etc/kubernetes/ssl/ca.crt --default-not-ready-toleration-seconds=300 --default-unreachable-toleration-seconds=300 --enable-admission-plugins=NodeRestriction --enable-aggregator-routing=False --enable-bootstrap-token-auth=true --endpoint-reconciler-type=lease --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt --etcd-certfile=/etc/kubernetes/ssl/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/ssl/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --event-ttl=1h0m0s --kubelet-client-certificate=/etc/kubernetes/ssl/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/ssl/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalDNS,InternalIP,Hostname,ExternalDNS,ExternalIP --profiling=False --proxy-client-cert-file=/etc/kubernetes/ssl/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/ssl/front-proxy-client.key --request-timeout=1m0s --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/ssl/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/etc/kubernetes/ssl/sa.pub --service-account-lookup=True --service-account-signing-key-file=/etc/kubernetes/ssl/sa.key --service-cluster-ip-range=10.233.0.0/18 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/etc/kubernetes/ssl/apiserver.crt --tls-private-key-file=/etc/kubernetes/ssl/apiserver.key

# ps -elF | grep 2737683 | grep -v grep 
# output
0 S root     2737683       1  0  80   0 - 178160 futex_ 13124  5 1月02 ?       00:03:22 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id fd272ecbe1bf56fb5fbff41130437e5c4b5b444dab2e315a19ed846dc1eacd42 -address /run/containerd/containerd.sock
4 S 65535    2737744 2737683  0  80   0 -   245 sys_pa     4   2 1月02 ?       00:00:00 /pause
4 S root     2738052 2737683 18  80   0 - 313640 futex_ 487444 5 1月02 ?       15:09:52 kube-apiserver --advertise-address=10.29.15.79 --allow-privileged=true --anonymous-auth=True --apiserver-count=1 --audit-log-maxage=30 --audit-log-maxbackup=1 --audit-log-maxsize=100 --audit-log-path=/var/log/audit/kube-apiserver-audit.log --audit-policy-file=/etc/kubernetes/audit-policy/apiserver-audit-policy.yaml --authorization-mode=Node,RBAC --bind-address=0.0.0.0 --client-ca-file=/etc/kubernetes/ssl/ca.crt --default-not-ready-toleration-seconds=300 --default-unreachable-toleration-seconds=300 --enable-admission-plugins=NodeRestriction --enable-aggregator-routing=False --enable-bootstrap-token-auth=true --endpoint-reconciler-type=lease --etcd-cafile=/etc/kubernetes/ssl/etcd/ca.crt --etcd-certfile=/etc/kubernetes/ssl/apiserver-etcd-client.crt --etcd-keyfile=/etc/kubernetes/ssl/apiserver-etcd-client.key --etcd-servers=https://127.0.0.1:2379 --event-ttl=1h0m0s --kubelet-client-certificate=/etc/kubernetes/ssl/apiserver-kubelet-client.crt --kubelet-client-key=/etc/kubernetes/ssl/apiserver-kubelet-client.key --kubelet-preferred-address-types=InternalDNS,InternalIP,Hostname,ExternalDNS,ExternalIP --profiling=False --proxy-client-cert-file=/etc/kubernetes/ssl/front-proxy-client.crt --proxy-client-key-file=/etc/kubernetes/ssl/front-proxy-client.key --request-timeout=1m0s --requestheader-allowed-names=front-proxy-client --requestheader-client-ca-file=/etc/kubernetes/ssl/front-proxy-ca.crt --requestheader-extra-headers-prefix=X-Remote-Extra- --requestheader-group-headers=X-Remote-Group --requestheader-username-headers=X-Remote-User --secure-port=6443 --service-account-issuer=https://kubernetes.default.svc.cluster.local --service-account-key-file=/etc/kubernetes/ssl/sa.pub --service-account-lookup=True --service-account-signing-key-file=/etc/kubernetes/ssl/sa.key --service-cluster-ip-range=10.233.0.0/18 --service-node-port-range=30000-32767 --storage-backend=etcd3 --tls-cert-file=/etc/kubernetes/ssl/apiserver.crt --tls-private-key-file=/etc/kubernetes/ssl/apiserver.key

# ps -ejH | grep -B 10 2738052  # 打印进程树
# output
2737683 2737683    1163 ?        00:03:24   containerd-shim
2737744 2737744 2737744 ?        00:00:00     pause
2738052 2738052 2738052 ?        15:17:02     kube-apiserver

# 结论
pid 2738052 是由 ppid 2737683 启动,而 2737683 是 containerd-shim 的 pid,因此 2738052 是 containerd-shim 的线程,也就是由它启动的 container

确认容器

# nerdctl ps | awk 'NR>1{print $1}' | while read line;do echo -n "ContainerID: " $line;  echo " Pid: " `nerdctl inspect $line --format="{{.State.Pid}}"`;done

# output
ContainerID:  001c06d37aa3 Pid:  3894654
ContainerID:  10948947ffe2 Pid:  3190
ContainerID:  12e473bf2092 Pid:  2738736
ContainerID:  2036c51b2eb1 Pid:  2638619
ContainerID:  31f6916a4d19 Pid:  2144173
ContainerID:  3e7aaa92e14d Pid:  946163
ContainerID:  40290d1314b0 Pid:  2737814
ContainerID:  41ef80123273 Pid:  2738052
ContainerID:  422a631fa706 Pid:  4151
ContainerID:  4bf701e368e1 Pid:  2737820
ContainerID:  535448e5cdd0 Pid:  2638612
ContainerID:  5ea9dbdedc25 Pid:  4711
ContainerID:  5f17a1c82e2a Pid:  2565
ContainerID:  68ddb09e71ba Pid:  4962
ContainerID:  69cf228ae17d Pid:  4054
ContainerID:  74d39157ea3c Pid:  2738680
ContainerID:  77cf2b283806 Pid:  5056
ContainerID:  7acb0441a3ad Pid:  2737689
ContainerID:  821dcae2eaca Pid:  1023468
ContainerID:  88060b9b98da Pid:  2448
ContainerID:  9179c60702c4 Pid:  2647586
ContainerID:  91ea0bcd6c1f Pid:  2737993
ContainerID:  95b54dbb14d0 Pid:  2286
ContainerID:  9cc402cb9ca2 Pid:  2716
ContainerID:  9dab08f90039 Pid:  946124
ContainerID:  ade4806bb944 Pid:  2518
ContainerID:  aee2b96ee5d9 Pid:  2544
ContainerID:  b3519e63725c Pid:  2737695
ContainerID:  b5bd53932069 Pid:  2295
ContainerID:  b816d668b027 Pid:  3962
ContainerID:  c8198ad4cc28 Pid:  2543
ContainerID:  d27c2a0c9a22 Pid:  1021592
ContainerID:  e9bdb4ad0fd6 Pid:  2350
ContainerID:  f41fa9aee395 Pid:  2737656
ContainerID:  f56f55af0af3 Pid:  4065
ContainerID:  fd272ecbe1bf Pid:  2737744

# grep 2738052 
# outout
ContainerID:  41ef80123273 Pid:  2738052

# nerdctl ps | grep 41ef80123273
# output
41ef80123273    k8s-gcr.m.daocloud.io/kube-apiserver:v1.24.7                             "kube-apiserver --ad…"    3 days ago      Up                 k8s://kube-system/kube-apiserver-master01/kube-apiserver

# kubectl get pod -o wide --all-namespaces  | grep kube-apiserver-master01
# output
kube-system           kube-apiserver-master01                                      1/1     Running            11 (3d9h ago)    67d   10.29.15.79      master01   <none>           <none>

总结

  • pidstat -rsl 1 3 | awk ‘NR>1{if($7>40000000) print}’

  -r 查看 pid 占用的内存信息,虚拟内存/实际使用内存/占总内存的百分比
  -s 查看 pid 占用的内存堆栈
  -l 显示详细的 command
  • ps -elF | grep $PID

  -e 所有进程
  -L 显示线程,带有 LWP 和 NLWP 列,结合 -f 使用
  -l 通常是输出格式的控制
  -F 按指定格式打印进程/线程,信息更完整,结合 -el
  
  -elF 可以找到其线程的 ppid,我的环境中一次就定位到 pid,说明 pod 没有其它的主进程或者更多的线程,而像业务 pod,一般都是用 shell 脚本拉起主进程,而主进程再启动更多的线程,如果要定为就是多次使用 ps -elF 进行 ppid 的查询,并最后通过 ppid 来找到该线程/进程是哪个容器和 pod 产生的 
  
  -efL 可以精确的统计每个进程的线程数
  
  -ejH # 打印进程树、更直观