如何保护系统级Pod不被Kubelet驱逐?

一、环境版本信息

Docker 17.03.1-ce
Kubeadm v1.11.0
Kubelet v1.11.0
Kubectl v1.11.0
Calico v3.1.3

二、问题描述

使用Kubeadm搭建的Kubernetes集群在资源紧张的情况下,Kubernetes Control Plan开始挂掉了,集群开始变得不好用。比如你可以让你的Kubernetes Control Plan所在的服务器,磁盘使用率超过90%,问题就会立马复现。

三、问题定位

经排查发现,资源紧张的情况下,Kubelet会驱逐对应Node上的Pod,当然,这也包括Kubernetes Control Plan的相关Pod,会导致其被杀死及其相关镜像被删除。Kubelet这么做是为了释放资源,当然把Kubernetes Control Plan的相关Pod都干掉,并不是用户所希望看到的。一旦放大到Node节点上,问题就变成Node节点上非常重要的系统Pod会被干掉,导致相关的Node变成不好用的Node节点了。一旦放大到整个集群上,就变成Kubernetes Control Plan的相关Pod、Node节点上非常重要的系统级别Pod和集群上非常重要的Addon级别的Pod都会被干掉。问题的严重性,在规模效应下,可想而知了。

四、问题分析

有些Kubernetes的使用者可能会说,Kubeadm搭建的Kubernetes集群不推荐在生产环境使用,你非要用它,肯定会有问题。这一点也不奇怪。因为一直以来,使用Kubeadm搭建的Kubernetes集群都不推荐在生产环境使用,最开始是官方不推荐,大部分使用者也随着不推荐。随着Kubernetes的发展,我们逐渐发现使用Kubeadm搭建Kubernetes集群原有那句不推荐在生产环境使用的禁止提示不见了,你也会发现官方文档上花费了大量的笔墨介绍Kubeadm这种安装方式,甚至关于这种方式的高可用方案,官方文档也开始有了描述。但是对于其他的安装方式都是一笔带过,甚至只字未提。说明了什么?那便是官方推荐并鼓励用户使用这种方式部署Kubernetes集群。

该种方式搭建的集群真得就能应用于生产环境吗?如果不能,那么又是哪些问题导致的它不能应用于生产环境呢?笔者在实际工作中做了大量的实践,目前已经将其应用于生产环境了。实践过程中发现,Kubeadm已经相当成熟了,问题并不是很多,不过还是有一些的,但是就算有问题,读其规范的代码结构,改起来也相当方便。笔者在生产环境使用的版本,就做过一些修改。

这里笔者只给出Kubeadm在实际应用中两个大问题,并针对其中一个致命问题在本文中详细描述其解决方案。

  1. 时区问题。一直以来,时区问题都是个看似简单又不引起重视的问题,特别对于美国用户,默认的UTC时区足够满足用户的需求了。那么生活在其他时区的用户怎么办?特别是我们这些生活在中国的用户该怎么办?修改Kubeadm的源码,在Kubernetes Control Plan的相关Pod的YAML上,加入社区设置的相关支持,详见《Docker容器的时区设置》。
  2. 当节点资源紧张,Kubelet开始驱逐Pod,删除其对应的镜像,来释放资源。看似没有问题,实际呢?很悲剧,这是个悲催的致命问题。为什么呢?对于Kubeadm这种安装方式,kube-apiserver、kube-controller-manager、kube-scheduler这种静态Pod没有问题,官方有解决方法不让其被驱逐,但是对于以Addon方式安装的系统级Pod呢?悲催了,没有保护机制。想重现问题很简单,磁盘使用率到90%以上,集群立刻开始往复于自杀和恢复之间。能解决吗?当然可以,不过社区现在并没有解决,笔者查看了GitHub上相关的Issue,详见参考资料。看来为了应用到生产环境,只能自己动手解决了。怎么解决呢?这个问题是Kubelet导致的,肯定是需要修改Kubelet的源码。

五、问题解决(这里以Kubernetes v1.11.0版本为例,更高的版本会有所变化)

示例如下:(以示例描述,如何看笔者要添加和修改的代码)
example.go

1
2
3
...
这里是要添加和修改的代码
...

下面看具体的源码修改:

1. kube_features.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
// defaultKubernetesFeatureGates consists of all known Kubernetes-specific feature keys.
// To add a new feature, define a key for it above and add it here. The features will be
// available throughout Kubernetes binaries.
var defaultKubernetesFeatureGates = map[utilfeature.Feature]utilfeature.FeatureSpec{
AppArmor: {Default: true, PreRelease: utilfeature.Beta},
DynamicKubeletConfig: {Default: true, PreRelease: utilfeature.Beta},
ExperimentalHostUserNamespaceDefaultingGate: {Default: false, PreRelease: utilfeature.Beta},
ExperimentalCriticalPodAnnotation: {Default: false, PreRelease: utilfeature.Alpha},
...
ExperimentalNotEvictedPodAnnotation: {Default: false, PreRelease: utilfeature.Alpha},
...
DevicePlugins: {Default: true, PreRelease: utilfeature.Beta},
TaintBasedEvictions: {Default: false, PreRelease: utilfeature.Alpha},
RotateKubeletServerCertificate: {Default: false, PreRelease: utilfeature.Alpha},
RotateKubeletClientCertificate: {Default: true, PreRelease: utilfeature.Beta},
PersistentLocalVolumes: {Default: true, PreRelease: utilfeature.Beta},
LocalStorageCapacityIsolation: {Default: true, PreRelease: utilfeature.Beta},
HugePages: {Default: true, PreRelease: utilfeature.Beta},
Sysctls: {Default: true, PreRelease: utilfeature.Beta},
DebugContainers: {Default: false, PreRelease: utilfeature.Alpha},
PodShareProcessNamespace: {Default: false, PreRelease: utilfeature.Alpha},
PodPriority: {Default: true, PreRelease: utilfeature.Beta},
EnableEquivalenceClassCache: {Default: false, PreRelease: utilfeature.Alpha},
TaintNodesByCondition: {Default: false, PreRelease: utilfeature.Alpha},
MountPropagation: {Default: true, PreRelease: utilfeature.Beta},
QOSReserved: {Default: false, PreRelease: utilfeature.Alpha},
ExpandPersistentVolumes: {Default: true, PreRelease: utilfeature.Beta},
ExpandInUsePersistentVolumes: {Default: false, PreRelease: utilfeature.Alpha},
AttachVolumeLimit: {Default: false, PreRelease: utilfeature.Alpha},
CPUManager: {Default: true, PreRelease: utilfeature.Beta},
ServiceNodeExclusion: {Default: false, PreRelease: utilfeature.Alpha},
MountContainers: {Default: false, PreRelease: utilfeature.Alpha},
VolumeScheduling: {Default: true, PreRelease: utilfeature.Beta},
CSIPersistentVolume: {Default: true, PreRelease: utilfeature.Beta},
CustomPodDNS: {Default: true, PreRelease: utilfeature.Beta},
BlockVolume: {Default: false, PreRelease: utilfeature.Alpha},
StorageObjectInUseProtection: {Default: true, PreRelease: utilfeature.GA},
ResourceLimitsPriorityFunction: {Default: false, PreRelease: utilfeature.Alpha},
SupportIPVSProxyMode: {Default: true, PreRelease: utilfeature.GA},
SupportPodPidsLimit: {Default: false, PreRelease: utilfeature.Alpha},
HyperVContainer: {Default: false, PreRelease: utilfeature.Alpha},
ScheduleDaemonSetPods: {Default: false, PreRelease: utilfeature.Alpha},
TokenRequest: {Default: false, PreRelease: utilfeature.Alpha},
TokenRequestProjection: {Default: false, PreRelease: utilfeature.Alpha},
CRIContainerLogRotation: {Default: true, PreRelease: utilfeature.Beta},
GCERegionalPersistentDisk: {Default: true, PreRelease: utilfeature.Beta},
RunAsGroup: {Default: false, PreRelease: utilfeature.Alpha},
VolumeSubpath: {Default: true, PreRelease: utilfeature.GA},
BalanceAttachedNodeVolumes: {Default: false, PreRelease: utilfeature.Alpha},
DynamicProvisioningScheduling: {Default: false, PreRelease: utilfeature.Alpha},
PodReadinessGates: {Default: false, PreRelease: utilfeature.Beta},
VolumeSubpathEnvExpansion: {Default: false, PreRelease: utilfeature.Alpha},
KubeletPluginsWatcher: {Default: false, PreRelease: utilfeature.Alpha},
ResourceQuotaScopeSelectors: {Default: false, PreRelease: utilfeature.Alpha},
CSIBlockVolume: {Default: false, PreRelease: utilfeature.Alpha},

// inherited features from generic apiserver, relisted here to get a conflict if it is changed
// unintentionally on either side:
genericfeatures.StreamingProxyRedirects: {Default: true, PreRelease: utilfeature.Beta},
genericfeatures.AdvancedAuditing: {Default: true, PreRelease: utilfeature.Beta},
genericfeatures.APIResponseCompression: {Default: false, PreRelease: utilfeature.Alpha},
genericfeatures.Initializers: {Default: false, PreRelease: utilfeature.Alpha},
genericfeatures.APIListChunking: {Default: true, PreRelease: utilfeature.Beta},

// inherited features from apiextensions-apiserver, relisted here to get a conflict if it is changed
// unintentionally on either side:
apiextensionsfeatures.CustomResourceValidation: {Default: true, PreRelease: utilfeature.Beta},
apiextensionsfeatures.CustomResourceSubresources: {Default: true, PreRelease: utilfeature.Beta},

// features that enable backwards compatibility but are scheduled to be removed
ServiceProxyAllowExternalIPs: {Default: false, PreRelease: utilfeature.Deprecated},
ReadOnlyAPIDataVolumes: {Default: true, PreRelease: utilfeature.Deprecated},
}


// owner: @vishh
// alpha: v1.5
//
// Ensures guaranteed scheduling of pods marked with a special pod annotation `scheduler.alpha.kubernetes.io/critical-pod`
// and also prevents them from being evicted from a node.
// Note: This feature is not supported for `BestEffort` pods.
ExperimentalCriticalPodAnnotation utilfeature.Feature = "ExperimentalCriticalPodAnnotation"
...
// owner: @singhwang
// alpha: v1.5
//
// Ensures guaranteed scheduling of pods marked with a special pod annotation `scheduler.alpha.kubernetes.io/not-evicted-pod`
// and also prevents them from being evicted from a node.
ExperimentalNotEvictedPodAnnotation utilfeature.Feature = "ExperimentalNotEvictedPodAnnotation"
...
// owner: @jiayingz
// beta: v1.10
//
// Enables support for Device Plugins
DevicePlugins utilfeature.Feature = "DevicePlugins"

2. pod_update.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
const (
ConfigSourceAnnotationKey = "kubernetes.io/config.source"
ConfigMirrorAnnotationKey = v1.MirrorPodAnnotationKey
ConfigFirstSeenAnnotationKey = "kubernetes.io/config.seen"
ConfigHashAnnotationKey = "kubernetes.io/config.hash"
CriticalPodAnnotationKey = "scheduler.alpha.kubernetes.io/critical-pod"
...
NotEvictedPodAnnotationKey = "scheduler.alpha.kubernetes.io/not-evicted-pod"
...
)

// IsCriticalPodBasedOnPriority checks if the given pod is a critical pod based on priority resolved from pod Spec.
func IsCriticalPodBasedOnPriority(priority int32) bool {
if priority >= scheduling.SystemCriticalPriority {
return true
}
return false
}
...
// IsNotEvictedPod returns true if the pod bears the not evicted pod annotation key.
func IsNotEvictedPod(pod *v1.Pod) bool {
return IsNotEvicted(pod.Namespace, pod.Annotations)
}

// IsNotEvicted returns true if parameters bear the not evicted pod annotation key.
func IsNotEvicted(ns string, annotations map[string]string) bool {
// NotEvicted pods are restricted to "kube-system" namespace as of now.
if ns != kubeapi.NamespaceSystem {
return false
}
val, ok := annotations[NotEvictedPodAnnotationKey]
if ok && val == "" {
return true
}
return false
}
...

3. eviction_manager.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
func (m *managerImpl) evictPod(pod *v1.Pod, gracePeriodOverride int64, evictMsg string, annotations map[string]string) bool {
// If the pod is marked as critical and static, and support for critical pod annotations is enabled,
// do not evict such pods. Static pods are not re-admitted after evictions.
// https://github.com/kubernetes/kubernetes/issues/40573 has more details.
if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
kubelettypes.IsCriticalPod(pod) && kubepod.IsStaticPod(pod) {
glog.Errorf("eviction manager: cannot evict a critical static pod %s", format.Pod(pod))
return false
}
...
// If the pod is marked as not evicted, and support for not evicted pod annotations is enabled,
// do not evict such pods.
if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalNotEvictedPodAnnotation) &&
kubelettypes.IsNotEvictedPod(pod) {
glog.Errorf("eviction manager: cannot evict a marked not evicted pod %s", format.Pod(pod))
return false
}
...
status := v1.PodStatus{
Phase: v1.PodFailed,
Message: evictMsg,
Reason: Reason,
}
// record that we are evicting the pod
m.recorder.AnnotatedEventf(pod, annotations, v1.EventTypeWarning, Reason, evictMsg)
// this is a blocking call and should only return when the pod and its containers are killed.
err := m.killPodFunc(pod, status, &gracePeriodOverride)
if err != nil {
glog.Errorf("eviction manager: pod %s failed to evict %v", format.Pod(pod), err)
} else {
glog.Infof("eviction manager: pod %s is evicted successfully", format.Pod(pod))
}
return true
}

4. 重新编译源代码,生成新的Kubelet二进制文件,替换到对应的Node上,使用方式如下:

注意–feature-gates部分,其中ExperimentalCriticalPodAnnotation为官方原生支持的功能,它可以保证kube-system下的Critical级别的静态Pod始终不被Kubelet所驱逐。另一个ExperimentalNotEvictedPodAnnotation是笔者加入的功能特性,它可以保证kube-system下的Pod,如果其Annotations中包括scheduler.alpha.kubernetes.io/not-evicted-pod,那么这样的Pod也始终不会被Kubelet所驱逐。这个功能特性的目的是,为了保证以Kubernetes为基础建立的容器系统的系统级Pod和重要Addon级别的Pod(这些系统级Pod和重要Addon级别的Pod,很可能是为了扩充Kubernetes功能而加入的,随意被驱逐显然是不正确的做法),允许用户通过这个scheduler.alpha.kubernetes.io/not-evicted-pod把其保护起来,提高Kubernetes容器系统的稳健性,避免集群资源紧张时,陷入自杀和自我修复的循环中而无法恢复。

1
KUBELET_EXTRA_ARGS="--cgroup-driver=cgroupfs --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause-amd64:3.1 --feature-gates=ExperimentalCriticalPodAnnotation=true,ExperimentalNotEvictedPodAnnotation=true"

六、参考资料

https://github.com/kubernetes/kubernetes/tree/v1.11.0
https://github.com/kubernetes/kubernetes/issues/53659
https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
https://www.centos.bz/2017/09/linux-命令行-创建指定大小的文件/
https://www.hi-linux.com/posts/59095.html