Kubernetes批量删除资源对象的注意事项:删除顺序和监测删除状态的重要性

一、问题描述

公司内部的某平台,底层基于Kubernetes实现(Kubernetes的存储部分采用动态存储供应对接NFS)。平台内部有一个逻辑单元,可以调用接口创建。该逻辑单元对应到Kubernetes层面,是一系列紧密相关的namespace、deployment、statefulset、job、persistentvolumeclaim和persistentvolume等资源对象的集合体,我们可以把这个逻辑单元理解为一个复杂的Kubernetes上的应用,该应用还是跨namespace的。

同事们平时开发、调试和运维该逻辑单元的时候,通常会这么做:

  1. 开发和调试该逻辑单元的清理接口,直接调用Kubernetes API删除该逻辑单元对应的各种资源对象,删除过程就是直接删除,不控制顺序,不监测各个资源对象的删除状态;
  2. 运维该逻辑单元的清理操作,为了省事,直接使用kubectl清理该逻辑单元相关的namespace。

近期发现线上环境,频繁发生Pod无法删除,一直处于Terminating状态。关于这个问题,我自己偶尔见到,也频繁有同事和我反应这个问题。于是,我定位了一下,到无法删除的Pod所在的宿主机节点上,查看kubelet日志,发现报错信息如下:

1
2
3
4
5
6
。。。。。。
11月 16 14:11:34 node06 kubelet[5505]: E1116 14:11:34.324703 5505 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/9c7bb592-e883-11e8-aae0-0050568a2af3-pvc-9c75962b-e883-11e8-aae0-0050568a2af3\" (\"9c7bb592-e883-11e8-aae0-0050568a2af3\")" failed. No retries permitted until 2018-11-16 14:13:36.324623633 +0800 CST m=+15413.721922139 (durationBeforeRetry 2m2s). Error: "error cleaning subPath mounts for volume \"pvc-9c75962b-e883-11e8-aae0-0050568a2af3\" (UniqueName: \"kubernetes.io/nfs/9c7bb592-e883-11e8-aae0-0050568a2af3-pvc-9c75962b-e883-11e8-aae0-0050568a2af3\") pod \"9c7bb592-e883-11e8-aae0-0050568a2af3\" (UID: \"9c7bb592-e883-11e8-aae0-0050568a2af3\") : error reading /var/lib/kubelet/pods/9c7bb592-e883-11e8-aae0-0050568a2af3/volume-subpaths/pvc-9c75962b-e883-11e8-aae0-0050568a2af3/orderer1: lstat /var/lib/kubelet/pods/9c7bb592-e883-11e8-aae0-0050568a2af3/volume-subpaths/pvc-9c75962b-e883-11e8-aae0-0050568a2af3/orderer1/0: stale NFS file handle"
11月 16 14:11:34 node06 kubelet[5505]: E1116 14:11:34.324809 5505 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/b46aa346-e71d-11e8-b49a-0050568a2af3-pvc-5f5024f7-e6ed-11e8-989d-0050568a2af3\" (\"b46aa346-e71d-11e8-b49a-0050568a2af3\")" failed. No retries permitted until 2018-11-16 14:13:36.324727477 +0800 CST m=+15413.722025987 (durationBeforeRetry 2m2s). Error: "error cleaning subPath mounts for volume \"pvc-5f5024f7-e6ed-11e8-989d-0050568a2af3\" (UniqueName: \"kubernetes.io/nfs/b46aa346-e71d-11e8-b49a-0050568a2af3-pvc-5f5024f7-e6ed-11e8-989d-0050568a2af3\") pod \"b46aa346-e71d-11e8-b49a-0050568a2af3\" (UID: \"b46aa346-e71d-11e8-b49a-0050568a2af3\") : error reading /var/lib/kubelet/pods/b46aa346-e71d-11e8-b49a-0050568a2af3/volume-subpaths/pvc-5f5024f7-e6ed-11e8-989d-0050568a2af3/orderer1: lstat /var/lib/kubelet/pods/b46aa346-e71d-11e8-b49a-0050568a2af3/volume-subpaths/pvc-5f5024f7-e6ed-11e8-989d-0050568a2af3/orderer1/0: stale NFS file handle"
11月 16 14:11:34 node06 kubelet[5505]: E1116 14:11:34.325743 5505 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/9b2a49c7-e726-11e8-a7a5-0050568a2af3-pvc-9b07268f-e726-11e8-a7a5-0050568a2af3\" (\"9b2a49c7-e726-11e8-a7a5-0050568a2af3\")" failed. No retries permitted until 2018-11-16 14:13:36.325669009 +0800 CST m=+15413.722967535 (durationBeforeRetry 2m2s). Error: "error cleaning subPath mounts for volume \"pvc-9b07268f-e726-11e8-a7a5-0050568a2af3\" (UniqueName: \"kubernetes.io/nfs/9b2a49c7-e726-11e8-a7a5-0050568a2af3-pvc-9b07268f-e726-11e8-a7a5-0050568a2af3\") pod \"9b2a49c7-e726-11e8-a7a5-0050568a2af3\" (UID: \"9b2a49c7-e726-11e8-a7a5-0050568a2af3\") : error reading /var/lib/kubelet/pods/9b2a49c7-e726-11e8-a7a5-0050568a2af3/volume-subpaths/pvc-9b07268f-e726-11e8-a7a5-0050568a2af3/cli: lstat /var/lib/kubelet/pods/9b2a49c7-e726-11e8-a7a5-0050568a2af3/volume-subpaths/pvc-9b07268f-e726-11e8-a7a5-0050568a2af3/cli/1: stale NFS file handle"
11月 16 14:11:34 node06 kubelet[5505]: E1116 14:11:34.325843 5505 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/dc5262e0-e710-11e8-989d-0050568a2af3-pvc-66a1068f-e22d-11e8-a331-0050568a2af3\" (\"dc5262e0-e710-11e8-989d-0050568a2af3\")" failed. No retries permitted until 2018-11-16 14:13:36.325772394 +0800 CST m=+15413.723070969 (durationBeforeRetry 2m2s). Error: "error cleaning subPath mounts for volume \"pvc-66a1068f-e22d-11e8-a331-0050568a2af3\" (UniqueName: \"kubernetes.io/nfs/dc5262e0-e710-11e8-989d-0050568a2af3-pvc-66a1068f-e22d-11e8-a331-0050568a2af3\") pod \"dc5262e0-e710-11e8-989d-0050568a2af3\" (UID: \"dc5262e0-e710-11e8-989d-0050568a2af3\") : error reading /var/lib/kubelet/pods/dc5262e0-e710-11e8-989d-0050568a2af3/volume-subpaths/pvc-66a1068f-e22d-11e8-a331-0050568a2af3/org1-1000: lstat /var/lib/kubelet/pods/dc5262e0-e710-11e8-989d-0050568a2af3/volume-subpaths/pvc-66a1068f-e22d-11e8-a331-0050568a2af3/org1-1000/1: stale NFS file handle"
。。。。。。

二、问题定位

起初看到这段日志有点不知所云,Google了一下,看社区描述,有人说是Kuberentes的一个bug。尽管查阅到的资料如此显示,我还始终有个想法一致在脑海中游荡。那便是资源对象本身而言,有一种潜在的依赖关系存在。这种潜在的依赖关系可以简单描述为:对于某种具体的容器应用而言,某个deployment创建的pod是依赖于某个persistentvolumeclaim的。如果我删除这个容器应用,要是先把那个persistentvolumeclaim删除干净了,再去删除那个依赖它的deployment,Kuberentes回收资源就可能会发生上述问题,毕竟依赖的资源已经不存在了。后来经过一段时间的排查和定位,问题的本质原因正是如此。
因此该问题的本质原因可以用一句话简单概括:Kuberentes底层在真正做某个资源对象的删除操作时,发现该资源对象依赖另一个资源对象,需要解除二者的绑定关系,但是很可惜它依赖的那个资源对象已经先它一步被删除了,绑定关系解除还可能要做资源回收操作,资源不存在了,回收操作一直无法成功,目标资源对象也就一直无法成功删除。

为了便于理解我所述的这种依赖关系,给出相应的YAML,假设nginx的终止比较耗时,删除时如果不控制顺序,就可以模拟出相同的情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
namespace: default
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
image: nginx:1.15.3
imagePullPolicy: IfNotPresent
name: nginx
ports:
- containerPort: 80
protocol: TCP
volumeMounts:
- mountPath: /data/
name: data
restartPolicy: Always
volumes:
- name: data
persistentVolumeClaim:
claimName: nginx-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nginx-pvc
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1G
storageClassName: managed-nfs-storage

三、总结

  1. 以第二点中的YAML为例,就是说删除时,一定要先删除deployment,然后监测该deployment和其对应的pod副本都删除干净后,再去删除persistentvolumeclaim。否则一同提交给Kubernetes,如果该deployment和其对应的pod副本终止比较耗时,persistentvolumeclaim就会优先于deployment先完成删除,那么该问题就出现了。
  2. 对于deployment和job两种资源对象,检测其自身的删除状态应该就可以,实测发现自身删除了,对应的pod副本也就删除干净了;但是对于statefulset这种资源对象就不行,一定要检测对应pod副本的删除状态,实测自身删除了,对应的pod副本并没有删除干净,还要等待一会儿,才能删除干净。因此必须单独检测pod副本的删除状态,这点在写程序时,一定要格外注意。

四、代码示例:删除资源对象(Deployment)的同时,监测其删除情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
type Metadata struct {
Kind string `json:"kind"`
Name string `json:"name"`
Namespace string `json:"namespace"`
}

metadataDepList := make([]module.Metadata, 0)

for i := 0; i < len(metadataList); i++ {
metadata := metadataList[i]
glog.Info("metadata: ", metadata)
if metadata.Kind == ResourceKindDeployment {
metadataDepList = append(metadataDepList, metadata)
}
}

for i := 0; i < len(metadataDepList); i++ {
metadataDep := metadataDepList[i]
dep, err := clientset.AppsV1beta1().Deployments(metadataDep.Namespace).Get(metadataDep.Name, v1.GetOptions{})
if err != nil {
return err
}
labels := dep.Labels

sign := make(chan error, 1)
go func() {
selector := pkglabels.FormatLabels(labels)
opts := v1.ListOptions{
LabelSelector: selector,
}
depWatch, err := clientset.AppsV1beta1().Deployments(metadataDep.Namespace).Watch(opts)
if err != nil {
sign <- err
return
}

deleteDepLoop:
for {
select {
case data := <-depWatch.ResultChan():
glog.Infof("Deleted deployment.Name: %s. deployment.Namespace: %s. data.Type: %s", dep.Name, dep.Namespace, data.Type)
sign <- nil
if data.Type == watch.Deleted {
glog.Infof("Deleted deployment.Name: %s. deployment.Namespace: %s.", dep.Name, dep.Namespace)
break deleteDepLoop
}
case <-time.After(time.Duration(30) * time.Second):
sign <- errors.New("delete deployment timeout. ")
depWatch.Stop()
return
}
}
}()
err = <- sign
if err != nil {
return err
}

err = clientset.AppsV1beta1().Deployments(metadataDep.Namespace).Delete(metadataDep.Name, &v1.DeleteOptions{
PropagationPolicy: &deletePropagationBackground,
})
if err != nil {
return err
}
}