使用Docker容器快速搭建Open Falcon体验环境

一、环境的相关信息

1. 版本信息

Open Falcon v0.3

2. 服务器信息

192.168.112.129

二、实验过程记录

1. 下载相关的Docker镜像,如下所示:

1
2
3
4
docker pull mysql:5.7
docker pull redis:4-alpine3.8
docker pull openfalcon/falcon-plus:v0.3
docker pull openfalcon/falcon-dashboard:v0.2.1

2. 准备配置文件,如下所示:

1
2
mkdir -p /open-falcon
git clone https://github.com/singhwang/falcon-config.git /open-falcon/

3. 启动相关的Docker容器,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# 清除可能存在的历史数据
rm -rf /tmp/falcon-plus/
rm -rf /home/work/

# 启动mysql容器
docker run -itd \
--name falcon-mysql \
-v /home/work/mysql-data:/var/lib/mysql \
-e MYSQL_ROOT_PASSWORD=test123456 \
-p 3306:3306 \
mysql:5.7

# 初始化mysql数据库
cd /tmp && \
git clone https://github.com/open-falcon/falcon-plus.git && \
cd /tmp/falcon-plus/ && \
git checkout -b v0.3.tag v0.3 && \
for x in `ls ./scripts/mysql/db_schema/*.sql`; do
echo init mysql table $x ...;
docker exec -i falcon-mysql mysql -uroot -ptest123456 < $x;
done

# 启动redis容器
docker run --name falcon-redis -p 6379:6379 -d redis:4-alpine3.8

# 启动falcon plus容器
# 注意:falcon plus容器相当于falcon的沙盒环境,在里面可以启动全套的falcon服务
docker run -itd --name falcon-plus \
--link=falcon-mysql:db.falcon \
--link=falcon-redis:redis.falcon \
-p 8433:8433 \
-p 8080:8080 \
-p 6030:6030 \
-e MYSQL_PORT=root:test123456@tcp\(db.falcon:3306\) \
-e REDIS_PORT=redis.falcon:6379 \
-v /open-falcon/ctrl.sh:/open-falcon/ctrl.sh \
-v /open-falcon/graph/config/cfg.json:/open-falcon/graph/config/cfg.json \
-v /open-falcon/hbs/config/cfg.json:/open-falcon/hbs/config/cfg.json \
-v /open-falcon/judge/config/cfg.json:/open-falcon/judge/config/cfg.json \
-v /open-falcon/transfer/config/cfg.json:/open-falcon/transfer/config/cfg.json \
-v /open-falcon/nodata/config/cfg.json:/open-falcon/nodata/config/cfg.json \
-v /open-falcon/aggregator/config/cfg.json:/open-falcon/aggregator/config/cfg.json \
-v /open-falcon/agent/config/cfg.json:/open-falcon/agent/config/cfg.json \
-v /open-falcon/gateway/config/cfg.json:/open-falcon/gateway/config/cfg.json \
-v /open-falcon/api/config/cfg.json:/open-falcon/api/config/cfg.json \
-v /open-falcon/alarm/config/cfg.json:/open-falcon/alarm/config/cfg.json \
-v /home/work/open-falcon/data:/open-falcon/data \
-v /home/work/open-falcon/logs:/open-falcon/logs \
openfalcon/falcon-plus:v0.3

# 启动falcon的所有服务
docker exec falcon-plus sh ctrl.sh start \
graph hbs judge transfer nodata aggregator agent gateway api alarm

docker exec falcon-plus ./open-falcon check

# 启动falcon dashboard服务
docker run -itd --name falcon-dashboard \
-p 8081:8081 \
--link=falcon-mysql:db.falcon \
--link=falcon-plus:api.falcon \
-e API_ADDR=http://api.falcon:8080/api/v1 \
-e PORTAL_DB_HOST=db.falcon \
-e PORTAL_DB_PORT=3306 \
-e PORTAL_DB_USER=root \
-e PORTAL_DB_PASS=test123456 \
-e PORTAL_DB_NAME=falcon_portal \
-e ALARM_DB_HOST=db.falcon \
-e ALARM_DB_PORT=3306 \
-e ALARM_DB_USER=root \
-e ALARM_DB_PASS=test123456 \
-e ALARM_DB_NAME=alarms \
-w /open-falcon/dashboard openfalcon/falcon-dashboard:v0.2.1 \
'./control startfg'

# 如何停止agent服务
docker exec falcon-plus sh ctrl.sh stop agent

三、参考资料

https://github.com/open-falcon/falcon-plus
http://open-falcon.org/
https://book.open-falcon.org/zh_0_2/
http://open-falcon.org/falcon-plus/
https://www.bookstack.cn/read/open-falcon-v0.2/SUMMARY.md
https://github.com/open-falcon/falcon-plus/issues/606

Kubernetes集群中验证集群网络

在Kubernetes集群中验证网络是否好用,一般从以下几个方面入手:

  1. 相同主机上的不同Pod之间的网络连通性;
  2. 不同主机上的不同Pod之间的网络连通性;
  3. 不同主机上的Pod中能否解析Kubernetes集群中的DNS记录。

为了顺利完成上述验证,需要准备一个验证环境。下面先介绍一下,如何构建这样一个网络验证环境,然后再介绍如何验证网络。

一、构建集群网络验证环境

1. 构建网络验证环境基础镜像

Alpine 3.8 Dockerfile Example

1
2
3
4
5
FROM alpine:3.8

MAINTAINER wangxin_0611@126.com

RUN apk add --no-cache ca-certificates bind-tools iputils iproute2 net-tools tcpdump

Ubuntu 16.04 Dockerfile Example

1
2
3
4
5
6
7
8
9
10
FROM ubuntu:16.04

MAINTAINER wangxin_0611@126.com

RUN apt-get update && \
apt-get install -y iproute2 && \
apt-get install -y dnsutils && \
apt-get install -y net-tools && \
apt-get install -y iputils-ping && \
apt-get install -y tcpdump

CentOS 7.5.1804 Dockerfile Example

1
2
3
4
5
6
7
8
9
10
FROM centos:7.5.1804

MAINTAINER wangxin_0611@126.com

RUN yum makecache fast && \
yum install -y iproute && \
yum install -y bind-utils && \
yum install -y net-tools && \
yum install -y iputils && \
yum install -y tcpdump

2. 利用基础镜像在各个宿主机上部署一个Pod,对于该种情况,利用DaemonSet实现最为合适。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: network
namespace: default
spec:
selector:
matchLabels:
app: network
template:
metadata:
labels:
app: network
spec:
containers:
- name: network
image: 10.0.55.126/base/alpine:3.8-network
imagePullPolicy: IfNotPresent
command:
- sleep
- "3600"
restartPolicy: Always
tolerations:
- effect: NoSchedule
operator: Exists

创建后,查看DaemonSet和其对应的Pod信息如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@master alpine]# kubectl get daemonsets
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
network 9 9 9 9 9 <none> 1m
[root@master alpine]# kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
network-76xzk 1/1 Running 0 1m 10.211.3.42 node03
network-79dzf 1/1 Running 0 1m 10.211.2.195 node02
network-ftn7g 1/1 Running 0 1m 10.211.4.25 node04
network-jbr8g 1/1 Running 0 1m 10.211.13.187 node07
network-kflgv 1/1 Running 0 1m 10.211.0.153 master
network-mvqlx 1/1 Running 0 1m 10.211.14.97 node08
network-nbzsc 1/1 Running 0 1m 10.211.12.94 node06
network-rxc2f 1/1 Running 0 1m 10.211.6.5 node05
network-w89xg 1/1 Running 0 1m 10.211.1.240 node01

二、验证集群网络

1. 进入master上对应的Pod后,ping各个Node上对应的Pod的IP地址,以此验证连通性;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[root@master alpine]# kubectl exec -it network-kflgv /bin/sh
/ # ping -c 4 10.211.1.240
PING 10.211.1.240 (10.211.1.240) 56(84) bytes of data.
64 bytes from 10.211.1.240: icmp_seq=1 ttl=62 time=0.575 ms
64 bytes from 10.211.1.240: icmp_seq=2 ttl=62 time=0.374 ms
64 bytes from 10.211.1.240: icmp_seq=3 ttl=62 time=0.445 ms
64 bytes from 10.211.1.240: icmp_seq=4 ttl=62 time=0.380 ms

--- 10.211.1.240 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3103ms
rtt min/avg/max/mdev = 0.374/0.443/0.575/0.083 ms
/ # ping -c 4 10.211.2.195
PING 10.211.2.195 (10.211.2.195) 56(84) bytes of data.
64 bytes from 10.211.2.195: icmp_seq=1 ttl=62 time=0.390 ms
64 bytes from 10.211.2.195: icmp_seq=2 ttl=62 time=0.544 ms
64 bytes from 10.211.2.195: icmp_seq=3 ttl=62 time=0.460 ms
64 bytes from 10.211.2.195: icmp_seq=4 ttl=62 time=0.483 ms

--- 10.211.2.195 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3064ms
rtt min/avg/max/mdev = 0.390/0.469/0.544/0.057 ms
/ # ping -c 4 10.211.3.42
PING 10.211.3.42 (10.211.3.42) 56(84) bytes of data.
64 bytes from 10.211.3.42: icmp_seq=1 ttl=62 time=0.598 ms
64 bytes from 10.211.3.42: icmp_seq=2 ttl=62 time=0.463 ms
64 bytes from 10.211.3.42: icmp_seq=3 ttl=62 time=0.530 ms
64 bytes from 10.211.3.42: icmp_seq=4 ttl=62 time=0.426 ms

以此类推 。。。。。。

2. 验证各个Pod中能否解析Kubernetes集群中的DNS记录。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
[root@master alpine]# kubectl exec -it network-kflgv /bin/sh
/ # nslookup kubernetes.default
Server: 10.96.0.10
Address: 10.96.0.10#53

Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1

[root@master alpine]# kubectl exec -it network-w89xg /bin/sh
/ # nslookup kubernetes.default
Server: 10.96.0.10
Address: 10.96.0.10#53

Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1

[root@master alpine]# kubectl exec -it network-79dzf /bin/sh
/ # nslookup kubernetes.default
Server: 10.96.0.10
Address: 10.96.0.10#53

Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1

[root@master alpine]# kubectl exec -it network-76xzk /bin/sh
/ # nslookup kubernetes.default
Server: 10.96.0.10
Address: 10.96.0.10#53

Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1

以此类推 。。。。。。

Prometheus 经典资料汇总

官方文档

https://prometheus.io/docs/prometheus/2.2/getting_started/

Prometheus Operator 开源项目地址

https://github.com/coreos/prometheus-operator/tree/v0.25.0

经典中文资料

https://yunlzheng.gitbook.io/prometheus-book/introduction

使用Golang实现Prometheus Exporter

https://blog.csdn.net/u014029783/article/details/80001251

解惑Prometheus API查询的时间戳格式

https://www.crifan.com/timestamp_format_support_decimal_point_or_not/

Golang实现四舍五入(监控指标的格式化经常用到)

https://www.jianshu.com/p/ca52f4f58353

使用Dragonfly实现P2P分发镜像

一、Dragonfly概述

1. 简介

Dragonfly 是一款基于 P2P 的智能镜像和文件分发工具。它旨在提高文件传输的效率和速率,最大限度地利用网络带宽,尤其是在分发大量数据时,例如应用分发、缓存分发、日志分发和镜像分发。
尽管容器技术大部分时候简化了运维工作,但是它也带来了一些挑战:例如镜像分发的效率问题,尤其是必须在多个主机上复制镜像分发时。Dragonfly 在这种场景下能够完美支持 Docker ,相比原生方式,它能将容器镜像的分发速度提高了 57 倍,并让 Registry 网络出口流量降低 99.5%。使用它可以让容器镜像的分发变得简单而经济。

2. 官方文档地址

https://d7y.io/en-us/

3. GitHub地址

https://github.com/dragonflyoss/Dragonfly/tree/v0.3.0

4. 注意事项

目前Dragonfly的最新版本是v0.3.0,暂不支持对镜像仓库中的对私有镜像的认证。比如使用Harbor这种镜像仓库,把某个project的权限设置为私有权限,那么该project下的镜像是无法直接通过Dragonfly实现镜像的分发的。想通过Dragonfly实现镜像的分发,最简单的办法是必须把镜像对应的project的权限设置为公有权限。还有一种办法是为docker daemon设置http全局代理,但是必须使用0.0.1版本,最新的v0.3.0版本不支持这种方式。不过很遗憾,经测试该方式不好用。官方的Issues链接如下:
https://github.com/dragonflyoss/Dragonfly/issues/138

为docker daemon配置全局代理,方法如下:

1
2
3
4
5
6
7
8
9
mkdir -p /etc/systemd/system/docker.service.d
vi /etc/systemd/system/docker.service.d/http-proxy.conf

[Service]
Environment="HTTP_PROXY=http://127.0.0.1:65001"

systemctl daemon-reload
systemctl restart docker.service
systemctl show --property=Environment docker.service

为docker daemon配置全局代理,参考链接如下:
https://docs.docker.com/config/daemon/systemd/#httphttps-proxy

二、环境的相关信息

1. 版本信息

Docker Engine Community 18.09.5
Harbor 0.5.0
Dragonfly v0.3.0

2. 服务器信息

172.16.170.134 <-> supernode
172.16.170.135 <-> dfclient
172.16.170.136 <-> dfclient

三、实验过程记录

1. 在172.16.170.134上安装supernode,如下所示:

1
2
docker run -d --name supernode --restart=always -p 8001:8001 -p 8002:8002 \
dragonflyoss/supernode:0.3.0 -Dsupernode.advertiseIp=172.16.170.134

2. 分别在172.16.170.135和172.16.170.136上安装dfclient,如下所示:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
cat <<EOD > /etc/dragonfly/dfget.yml
nodes:
- 172.16.170.134
EOD

cat <<EOD > /etc/docker/daemon.json
{
"registry-mirrors": ["http://127.0.0.1:65001"],
"insecure-registries": ["10.0.55.126"]
}
EOD

systemctl daemon-reload
systemctl restart docker.service

docker run -d --name dfclient --restart=always -p 65001:65001 \
-v /etc/dragonfly:/etc/dragonfly \
dragonflyoss/dfclient:v0.3.0 --registry http://10.0.55.126

3. 分别在172.16.170.135和172.16.170.136上安装拉取一个镜像,如下所示:

1
2
3
4
5
6
7
8
9
10
# docker pull base/alpine:3.8
3.8: Pulling from base/alpine
16f532fbdc2a: Already exists
Digest: sha256:78903b603e9fe76129d1a59ec94bc1ad47769b98e57e8f0c0a57760b12615960
Status: Downloaded newer image for base/alpine:3.8

# docker pull base/alpine:3.8
3.8: Pulling from base/alpine
Digest: sha256:78903b603e9fe76129d1a59ec94bc1ad47769b98e57e8f0c0a57760b12615960
Status: Image is up to date for base/alpine:3.8

四、参考资料

https://github.com/dragonflyoss/Dragonfly/tree/v0.0.1
https://github.com/dragonflyoss/Dragonfly/issues/138
http://dockone.io/article/4646
https://mp.weixin.qq.com/s/95mX8cDox5bmgQ2xGHLPqQ
https://www.cnblogs.com/atuotuo/p/7298673.html
https://docs.docker.com/config/daemon/systemd/
http://likakuli.com/post/2018/09/13/dragonfly/

Nginx Ingress实验小记

克隆源代码,切换到当前最新的稳定版本

1
2
3
git clone https://github.com/kubernetes/ingress-nginx.git
cd ingress-nginx/
git checkout -b nginx-0.21.0 nginx-0.21.0

创建Ingress代理的后端服务

1
2
mkdir -p deploy/example
vi deploy/example/nginx.yaml

添加如下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.15.4
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
---
kind: Service
apiVersion: v1
metadata:
name: nginx
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 80
targetPort: 80

Nginx Ingress Controller扮演四层负载均衡器,验证四层负载均衡

在ConfigMap-[tcp-services]的data部分添加要暴露的端口到service和对应端口的映射关系,详见下面的示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
kind: ConfigMap
apiVersion: v1
metadata:
name: nginx-configuration
namespace: ingress-nginx
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx

---
kind: ConfigMap
apiVersion: v1
metadata:
name: tcp-services
namespace: ingress-nginx
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
data:
8080: "default/nginx:80" # 添加要暴露的端口到service和对应端口的映射关系
---
kind: ConfigMap
apiVersion: v1
metadata:
name: udp-services
namespace: ingress-nginx
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx

编辑deploy/with-rbac.yaml,把Pod模板中网络模式改为hostNetwork。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx-ingress-controller
namespace: ingress-nginx
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
template:
metadata:
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
annotations:
prometheus.io/port: "10254"
prometheus.io/scrape: "true"
spec:
hostNetwork: true # 注意这里设置为hostNetwork
serviceAccountName: nginx-ingress-serviceaccount
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.21.0
args:
- /nginx-ingress-controller
- --configmap=$(POD_NAMESPACE)/nginx-configuration
- --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
- --udp-services-configmap=$(POD_NAMESPACE)/udp-services
- --publish-service=$(POD_NAMESPACE)/ingress-nginx
- --annotations-prefix=nginx.ingress.kubernetes.io
# - --http-port=8080 # 监听的http端口,默认80
# - --https-port=8443 # 监听的https端口,默认443
securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
# www-data -> 33
runAsUser: 33
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- name: http
containerPort: 80
- name: https
containerPort: 443
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1

---

在Kubernetes中执行如下操作:

1
2
3
4
5
kubectl create -f deploy/namespace.yaml
kubectl create -f deploy/configmap.yaml
kubectl create -f deploy/rbac.yaml
kubectl create -f deploy/with-rbac.yaml
kubectl create -f deploy/provider/baremetal/service-nodeport.yaml

查看Nginx Ingress Controller所在的Node信息

1
2
3
# kubectl get pod -n ingress-nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE
nginx-ingress-controller-7799468ccb-5bzgv 1/1 Running 0 2m 172.16.170.129 server02

浏览器访问,四层负载均衡的效果如下图所示:

ingress_4

Nginx Ingress Controller扮演七层负载均衡器,验证七层负载均衡

修改deploy/with-rbac.yaml,这里使用hostPort暴露http端口和https端口,不再使用hostNetwork。因为hostNetwork使用的是Node的网络栈,包括DNS服务器和解析。对于七层负载均衡器,暴露http和https的默认端口即可,可以通过不同的路径映射到不同的service上。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx-ingress-controller
namespace: ingress-nginx
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
template:
metadata:
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
annotations:
prometheus.io/port: "10254"
prometheus.io/scrape: "true"
spec:
# hostNetwork: true
serviceAccountName: nginx-ingress-serviceaccount
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.21.0
args:
- /nginx-ingress-controller
- --configmap=$(POD_NAMESPACE)/nginx-configuration
- --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
- --udp-services-configmap=$(POD_NAMESPACE)/udp-services
- --publish-service=$(POD_NAMESPACE)/ingress-nginx
- --annotations-prefix=nginx.ingress.kubernetes.io
# - --http-port=8080 # 监听的http端口,默认80
# - --https-port=8443 # 监听的https端口,默认443
securityContext:
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
# www-data -> 33
runAsUser: 33
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- name: http
containerPort: 80
hostPort: 80 # 使用hostPort暴露http端口
- name: https
containerPort: 443
hostPort: 443 # 使用hostPort暴露https端口
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10254
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1

---

创建deploy/example/ingress.yaml,添加如下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: nginx-ingress-app
namespace: default
spec:
rules:
- host: foo.bar.com
http:
paths:
- path: /
backend:
serviceName: nginx
servicePort: 80

在Kubernetes集群中创建Ingress资源对象:

1
kubectl create -f eploy/example/ingress.yaml

浏览器访问,七层负载均衡的效果如下图所示:

ingress_7

参考资料

https://github.com/kubernetes/ingress-nginx/tree/nginx-0.21.0
https://kubernetes.github.io/ingress-nginx/
https://kubernetes.io/docs/concepts/services-networking/ingress/
https://www.cnblogs.com/iiiiher/p/8006801.html

手动清理Docker容器产生的日志

实际工作中,发现有些同事会有裸跑Docker容器的用法,最主要的是用json-file的日志驱动,还不配置日志轮转清理,具体原因不详。应用产生的日志量也不小,一段时间后,有趣的事情发生了:Docker莫名其妙不好使了,一查原因日志报盘了。上网找了一下资料,找到了比较好用的手动快速清理日志的方法,这里记录一下,供日后查阅使用。原创地址详见“参考资料”。

1. 在Docker默认配置下,利用下面的命令查看各个容器的日志文件大小;

1
ls -lh $(find /var/lib/docker/containers/ -name *-json.log)

2. 利用下面的脚本,可以手动清理宿主机上Docker容器产生的日志。

1
2
3
4
5
6
7
8
9
10
11
12
13
#!/bin/sh
echo "==================== start clean docker containers logs =========================="

logs=$(find /var/lib/docker/containers/ -name *-json.log)

for log in $logs
do
echo "clean logs : $log"
cat /dev/null > $log
done


echo "==================== end clean docker containers logs =========================="

参考资料:
https://blog.csdn.net/xunzhaoyao/article/details/72959917

如何保护系统级Pod不被Kubelet驱逐?

一、环境版本信息

Docker 17.03.1-ce
Kubeadm v1.11.0
Kubelet v1.11.0
Kubectl v1.11.0
Calico v3.1.3

二、问题描述

使用Kubeadm搭建的Kubernetes集群在资源紧张的情况下,Kubernetes Control Plan开始挂掉了,集群开始变得不好用。比如你可以让你的Kubernetes Control Plan所在的服务器,磁盘使用率超过90%,问题就会立马复现。

三、问题定位

经排查发现,资源紧张的情况下,Kubelet会驱逐对应Node上的Pod,当然,这也包括Kubernetes Control Plan的相关Pod,会导致其被杀死及其相关镜像被删除。Kubelet这么做是为了释放资源,当然把Kubernetes Control Plan的相关Pod都干掉,并不是用户所希望看到的。一旦放大到Node节点上,问题就变成Node节点上非常重要的系统Pod会被干掉,导致相关的Node变成不好用的Node节点了。一旦放大到整个集群上,就变成Kubernetes Control Plan的相关Pod、Node节点上非常重要的系统级别Pod和集群上非常重要的Addon级别的Pod都会被干掉。问题的严重性,在规模效应下,可想而知了。

四、问题分析

有些Kubernetes的使用者可能会说,Kubeadm搭建的Kubernetes集群不推荐在生产环境使用,你非要用它,肯定会有问题。这一点也不奇怪。因为一直以来,使用Kubeadm搭建的Kubernetes集群都不推荐在生产环境使用,最开始是官方不推荐,大部分使用者也随着不推荐。随着Kubernetes的发展,我们逐渐发现使用Kubeadm搭建Kubernetes集群原有那句不推荐在生产环境使用的禁止提示不见了,你也会发现官方文档上花费了大量的笔墨介绍Kubeadm这种安装方式,甚至关于这种方式的高可用方案,官方文档也开始有了描述。但是对于其他的安装方式都是一笔带过,甚至只字未提。说明了什么?那便是官方推荐并鼓励用户使用这种方式部署Kubernetes集群。

该种方式搭建的集群真得就能应用于生产环境吗?如果不能,那么又是哪些问题导致的它不能应用于生产环境呢?笔者在实际工作中做了大量的实践,目前已经将其应用于生产环境了。实践过程中发现,Kubeadm已经相当成熟了,问题并不是很多,不过还是有一些的,但是就算有问题,读其规范的代码结构,改起来也相当方便。笔者在生产环境使用的版本,就做过一些修改。

这里笔者只给出Kubeadm在实际应用中两个大问题,并针对其中一个致命问题在本文中详细描述其解决方案。

  1. 时区问题。一直以来,时区问题都是个看似简单又不引起重视的问题,特别对于美国用户,默认的UTC时区足够满足用户的需求了。那么生活在其他时区的用户怎么办?特别是我们这些生活在中国的用户该怎么办?修改Kubeadm的源码,在Kubernetes Control Plan的相关Pod的YAML上,加入社区设置的相关支持,详见《Docker容器的时区设置》。
  2. 当节点资源紧张,Kubelet开始驱逐Pod,删除其对应的镜像,来释放资源。看似没有问题,实际呢?很悲剧,这是个悲催的致命问题。为什么呢?对于Kubeadm这种安装方式,kube-apiserver、kube-controller-manager、kube-scheduler这种静态Pod没有问题,官方有解决方法不让其被驱逐,但是对于以Addon方式安装的系统级Pod呢?悲催了,没有保护机制。想重现问题很简单,磁盘使用率到90%以上,集群立刻开始往复于自杀和恢复之间。能解决吗?当然可以,不过社区现在并没有解决,笔者查看了GitHub上相关的Issue,详见参考资料。看来为了应用到生产环境,只能自己动手解决了。怎么解决呢?这个问题是Kubelet导致的,肯定是需要修改Kubelet的源码。

五、问题解决(这里以Kubernetes v1.11.0版本为例,更高的版本会有所变化)

示例如下:(以示例描述,如何看笔者要添加和修改的代码)
example.go

1
2
3
...
这里是要添加和修改的代码
...

下面看具体的源码修改:

1. kube_features.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
// defaultKubernetesFeatureGates consists of all known Kubernetes-specific feature keys.
// To add a new feature, define a key for it above and add it here. The features will be
// available throughout Kubernetes binaries.
var defaultKubernetesFeatureGates = map[utilfeature.Feature]utilfeature.FeatureSpec{
AppArmor: {Default: true, PreRelease: utilfeature.Beta},
DynamicKubeletConfig: {Default: true, PreRelease: utilfeature.Beta},
ExperimentalHostUserNamespaceDefaultingGate: {Default: false, PreRelease: utilfeature.Beta},
ExperimentalCriticalPodAnnotation: {Default: false, PreRelease: utilfeature.Alpha},
...
ExperimentalNotEvictedPodAnnotation: {Default: false, PreRelease: utilfeature.Alpha},
...
DevicePlugins: {Default: true, PreRelease: utilfeature.Beta},
TaintBasedEvictions: {Default: false, PreRelease: utilfeature.Alpha},
RotateKubeletServerCertificate: {Default: false, PreRelease: utilfeature.Alpha},
RotateKubeletClientCertificate: {Default: true, PreRelease: utilfeature.Beta},
PersistentLocalVolumes: {Default: true, PreRelease: utilfeature.Beta},
LocalStorageCapacityIsolation: {Default: true, PreRelease: utilfeature.Beta},
HugePages: {Default: true, PreRelease: utilfeature.Beta},
Sysctls: {Default: true, PreRelease: utilfeature.Beta},
DebugContainers: {Default: false, PreRelease: utilfeature.Alpha},
PodShareProcessNamespace: {Default: false, PreRelease: utilfeature.Alpha},
PodPriority: {Default: true, PreRelease: utilfeature.Beta},
EnableEquivalenceClassCache: {Default: false, PreRelease: utilfeature.Alpha},
TaintNodesByCondition: {Default: false, PreRelease: utilfeature.Alpha},
MountPropagation: {Default: true, PreRelease: utilfeature.Beta},
QOSReserved: {Default: false, PreRelease: utilfeature.Alpha},
ExpandPersistentVolumes: {Default: true, PreRelease: utilfeature.Beta},
ExpandInUsePersistentVolumes: {Default: false, PreRelease: utilfeature.Alpha},
AttachVolumeLimit: {Default: false, PreRelease: utilfeature.Alpha},
CPUManager: {Default: true, PreRelease: utilfeature.Beta},
ServiceNodeExclusion: {Default: false, PreRelease: utilfeature.Alpha},
MountContainers: {Default: false, PreRelease: utilfeature.Alpha},
VolumeScheduling: {Default: true, PreRelease: utilfeature.Beta},
CSIPersistentVolume: {Default: true, PreRelease: utilfeature.Beta},
CustomPodDNS: {Default: true, PreRelease: utilfeature.Beta},
BlockVolume: {Default: false, PreRelease: utilfeature.Alpha},
StorageObjectInUseProtection: {Default: true, PreRelease: utilfeature.GA},
ResourceLimitsPriorityFunction: {Default: false, PreRelease: utilfeature.Alpha},
SupportIPVSProxyMode: {Default: true, PreRelease: utilfeature.GA},
SupportPodPidsLimit: {Default: false, PreRelease: utilfeature.Alpha},
HyperVContainer: {Default: false, PreRelease: utilfeature.Alpha},
ScheduleDaemonSetPods: {Default: false, PreRelease: utilfeature.Alpha},
TokenRequest: {Default: false, PreRelease: utilfeature.Alpha},
TokenRequestProjection: {Default: false, PreRelease: utilfeature.Alpha},
CRIContainerLogRotation: {Default: true, PreRelease: utilfeature.Beta},
GCERegionalPersistentDisk: {Default: true, PreRelease: utilfeature.Beta},
RunAsGroup: {Default: false, PreRelease: utilfeature.Alpha},
VolumeSubpath: {Default: true, PreRelease: utilfeature.GA},
BalanceAttachedNodeVolumes: {Default: false, PreRelease: utilfeature.Alpha},
DynamicProvisioningScheduling: {Default: false, PreRelease: utilfeature.Alpha},
PodReadinessGates: {Default: false, PreRelease: utilfeature.Beta},
VolumeSubpathEnvExpansion: {Default: false, PreRelease: utilfeature.Alpha},
KubeletPluginsWatcher: {Default: false, PreRelease: utilfeature.Alpha},
ResourceQuotaScopeSelectors: {Default: false, PreRelease: utilfeature.Alpha},
CSIBlockVolume: {Default: false, PreRelease: utilfeature.Alpha},

// inherited features from generic apiserver, relisted here to get a conflict if it is changed
// unintentionally on either side:
genericfeatures.StreamingProxyRedirects: {Default: true, PreRelease: utilfeature.Beta},
genericfeatures.AdvancedAuditing: {Default: true, PreRelease: utilfeature.Beta},
genericfeatures.APIResponseCompression: {Default: false, PreRelease: utilfeature.Alpha},
genericfeatures.Initializers: {Default: false, PreRelease: utilfeature.Alpha},
genericfeatures.APIListChunking: {Default: true, PreRelease: utilfeature.Beta},

// inherited features from apiextensions-apiserver, relisted here to get a conflict if it is changed
// unintentionally on either side:
apiextensionsfeatures.CustomResourceValidation: {Default: true, PreRelease: utilfeature.Beta},
apiextensionsfeatures.CustomResourceSubresources: {Default: true, PreRelease: utilfeature.Beta},

// features that enable backwards compatibility but are scheduled to be removed
ServiceProxyAllowExternalIPs: {Default: false, PreRelease: utilfeature.Deprecated},
ReadOnlyAPIDataVolumes: {Default: true, PreRelease: utilfeature.Deprecated},
}


// owner: @vishh
// alpha: v1.5
//
// Ensures guaranteed scheduling of pods marked with a special pod annotation `scheduler.alpha.kubernetes.io/critical-pod`
// and also prevents them from being evicted from a node.
// Note: This feature is not supported for `BestEffort` pods.
ExperimentalCriticalPodAnnotation utilfeature.Feature = "ExperimentalCriticalPodAnnotation"
...
// owner: @singhwang
// alpha: v1.5
//
// Ensures guaranteed scheduling of pods marked with a special pod annotation `scheduler.alpha.kubernetes.io/not-evicted-pod`
// and also prevents them from being evicted from a node.
ExperimentalNotEvictedPodAnnotation utilfeature.Feature = "ExperimentalNotEvictedPodAnnotation"
...
// owner: @jiayingz
// beta: v1.10
//
// Enables support for Device Plugins
DevicePlugins utilfeature.Feature = "DevicePlugins"

2. pod_update.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
const (
ConfigSourceAnnotationKey = "kubernetes.io/config.source"
ConfigMirrorAnnotationKey = v1.MirrorPodAnnotationKey
ConfigFirstSeenAnnotationKey = "kubernetes.io/config.seen"
ConfigHashAnnotationKey = "kubernetes.io/config.hash"
CriticalPodAnnotationKey = "scheduler.alpha.kubernetes.io/critical-pod"
...
NotEvictedPodAnnotationKey = "scheduler.alpha.kubernetes.io/not-evicted-pod"
...
)

// IsCriticalPodBasedOnPriority checks if the given pod is a critical pod based on priority resolved from pod Spec.
func IsCriticalPodBasedOnPriority(priority int32) bool {
if priority >= scheduling.SystemCriticalPriority {
return true
}
return false
}
...
// IsNotEvictedPod returns true if the pod bears the not evicted pod annotation key.
func IsNotEvictedPod(pod *v1.Pod) bool {
return IsNotEvicted(pod.Namespace, pod.Annotations)
}

// IsNotEvicted returns true if parameters bear the not evicted pod annotation key.
func IsNotEvicted(ns string, annotations map[string]string) bool {
// NotEvicted pods are restricted to "kube-system" namespace as of now.
if ns != kubeapi.NamespaceSystem {
return false
}
val, ok := annotations[NotEvictedPodAnnotationKey]
if ok && val == "" {
return true
}
return false
}
...

3. eviction_manager.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
func (m *managerImpl) evictPod(pod *v1.Pod, gracePeriodOverride int64, evictMsg string, annotations map[string]string) bool {
// If the pod is marked as critical and static, and support for critical pod annotations is enabled,
// do not evict such pods. Static pods are not re-admitted after evictions.
// https://github.com/kubernetes/kubernetes/issues/40573 has more details.
if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalCriticalPodAnnotation) &&
kubelettypes.IsCriticalPod(pod) && kubepod.IsStaticPod(pod) {
glog.Errorf("eviction manager: cannot evict a critical static pod %s", format.Pod(pod))
return false
}
...
// If the pod is marked as not evicted, and support for not evicted pod annotations is enabled,
// do not evict such pods.
if utilfeature.DefaultFeatureGate.Enabled(features.ExperimentalNotEvictedPodAnnotation) &&
kubelettypes.IsNotEvictedPod(pod) {
glog.Errorf("eviction manager: cannot evict a marked not evicted pod %s", format.Pod(pod))
return false
}
...
status := v1.PodStatus{
Phase: v1.PodFailed,
Message: evictMsg,
Reason: Reason,
}
// record that we are evicting the pod
m.recorder.AnnotatedEventf(pod, annotations, v1.EventTypeWarning, Reason, evictMsg)
// this is a blocking call and should only return when the pod and its containers are killed.
err := m.killPodFunc(pod, status, &gracePeriodOverride)
if err != nil {
glog.Errorf("eviction manager: pod %s failed to evict %v", format.Pod(pod), err)
} else {
glog.Infof("eviction manager: pod %s is evicted successfully", format.Pod(pod))
}
return true
}

4. 重新编译源代码,生成新的Kubelet二进制文件,替换到对应的Node上,使用方式如下:

注意–feature-gates部分,其中ExperimentalCriticalPodAnnotation为官方原生支持的功能,它可以保证kube-system下的Critical级别的静态Pod始终不被Kubelet所驱逐。另一个ExperimentalNotEvictedPodAnnotation是笔者加入的功能特性,它可以保证kube-system下的Pod,如果其Annotations中包括scheduler.alpha.kubernetes.io/not-evicted-pod,那么这样的Pod也始终不会被Kubelet所驱逐。这个功能特性的目的是,为了保证以Kubernetes为基础建立的容器系统的系统级Pod和重要Addon级别的Pod(这些系统级Pod和重要Addon级别的Pod,很可能是为了扩充Kubernetes功能而加入的,随意被驱逐显然是不正确的做法),允许用户通过这个scheduler.alpha.kubernetes.io/not-evicted-pod把其保护起来,提高Kubernetes容器系统的稳健性,避免集群资源紧张时,陷入自杀和自我修复的循环中而无法恢复。

1
KUBELET_EXTRA_ARGS="--cgroup-driver=cgroupfs --pod-infra-container-image=registry.cn-hangzhou.aliyuncs.com/google_containers/pause-amd64:3.1 --feature-gates=ExperimentalCriticalPodAnnotation=true,ExperimentalNotEvictedPodAnnotation=true"

六、参考资料

https://github.com/kubernetes/kubernetes/tree/v1.11.0
https://github.com/kubernetes/kubernetes/issues/53659
https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
https://www.centos.bz/2017/09/linux-命令行-创建指定大小的文件/
https://www.hi-linux.com/posts/59095.html

Kubernetes批量删除资源对象的注意事项:删除顺序和监测删除状态的重要性

一、问题描述

公司内部的某平台,底层基于Kubernetes实现(Kubernetes的存储部分采用动态存储供应对接NFS)。平台内部有一个逻辑单元,可以调用接口创建。该逻辑单元对应到Kubernetes层面,是一系列紧密相关的namespace、deployment、statefulset、job、persistentvolumeclaim和persistentvolume等资源对象的集合体,我们可以把这个逻辑单元理解为一个复杂的Kubernetes上的应用,该应用还是跨namespace的。

同事们平时开发、调试和运维该逻辑单元的时候,通常会这么做:

  1. 开发和调试该逻辑单元的清理接口,直接调用Kubernetes API删除该逻辑单元对应的各种资源对象,删除过程就是直接删除,不控制顺序,不监测各个资源对象的删除状态;
  2. 运维该逻辑单元的清理操作,为了省事,直接使用kubectl清理该逻辑单元相关的namespace。

近期发现线上环境,频繁发生Pod无法删除,一直处于Terminating状态。关于这个问题,我自己偶尔见到,也频繁有同事和我反应这个问题。于是,我定位了一下,到无法删除的Pod所在的宿主机节点上,查看kubelet日志,发现报错信息如下:

1
2
3
4
5
6
。。。。。。
11月 16 14:11:34 node06 kubelet[5505]: E1116 14:11:34.324703 5505 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/9c7bb592-e883-11e8-aae0-0050568a2af3-pvc-9c75962b-e883-11e8-aae0-0050568a2af3\" (\"9c7bb592-e883-11e8-aae0-0050568a2af3\")" failed. No retries permitted until 2018-11-16 14:13:36.324623633 +0800 CST m=+15413.721922139 (durationBeforeRetry 2m2s). Error: "error cleaning subPath mounts for volume \"pvc-9c75962b-e883-11e8-aae0-0050568a2af3\" (UniqueName: \"kubernetes.io/nfs/9c7bb592-e883-11e8-aae0-0050568a2af3-pvc-9c75962b-e883-11e8-aae0-0050568a2af3\") pod \"9c7bb592-e883-11e8-aae0-0050568a2af3\" (UID: \"9c7bb592-e883-11e8-aae0-0050568a2af3\") : error reading /var/lib/kubelet/pods/9c7bb592-e883-11e8-aae0-0050568a2af3/volume-subpaths/pvc-9c75962b-e883-11e8-aae0-0050568a2af3/orderer1: lstat /var/lib/kubelet/pods/9c7bb592-e883-11e8-aae0-0050568a2af3/volume-subpaths/pvc-9c75962b-e883-11e8-aae0-0050568a2af3/orderer1/0: stale NFS file handle"
11月 16 14:11:34 node06 kubelet[5505]: E1116 14:11:34.324809 5505 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/b46aa346-e71d-11e8-b49a-0050568a2af3-pvc-5f5024f7-e6ed-11e8-989d-0050568a2af3\" (\"b46aa346-e71d-11e8-b49a-0050568a2af3\")" failed. No retries permitted until 2018-11-16 14:13:36.324727477 +0800 CST m=+15413.722025987 (durationBeforeRetry 2m2s). Error: "error cleaning subPath mounts for volume \"pvc-5f5024f7-e6ed-11e8-989d-0050568a2af3\" (UniqueName: \"kubernetes.io/nfs/b46aa346-e71d-11e8-b49a-0050568a2af3-pvc-5f5024f7-e6ed-11e8-989d-0050568a2af3\") pod \"b46aa346-e71d-11e8-b49a-0050568a2af3\" (UID: \"b46aa346-e71d-11e8-b49a-0050568a2af3\") : error reading /var/lib/kubelet/pods/b46aa346-e71d-11e8-b49a-0050568a2af3/volume-subpaths/pvc-5f5024f7-e6ed-11e8-989d-0050568a2af3/orderer1: lstat /var/lib/kubelet/pods/b46aa346-e71d-11e8-b49a-0050568a2af3/volume-subpaths/pvc-5f5024f7-e6ed-11e8-989d-0050568a2af3/orderer1/0: stale NFS file handle"
11月 16 14:11:34 node06 kubelet[5505]: E1116 14:11:34.325743 5505 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/9b2a49c7-e726-11e8-a7a5-0050568a2af3-pvc-9b07268f-e726-11e8-a7a5-0050568a2af3\" (\"9b2a49c7-e726-11e8-a7a5-0050568a2af3\")" failed. No retries permitted until 2018-11-16 14:13:36.325669009 +0800 CST m=+15413.722967535 (durationBeforeRetry 2m2s). Error: "error cleaning subPath mounts for volume \"pvc-9b07268f-e726-11e8-a7a5-0050568a2af3\" (UniqueName: \"kubernetes.io/nfs/9b2a49c7-e726-11e8-a7a5-0050568a2af3-pvc-9b07268f-e726-11e8-a7a5-0050568a2af3\") pod \"9b2a49c7-e726-11e8-a7a5-0050568a2af3\" (UID: \"9b2a49c7-e726-11e8-a7a5-0050568a2af3\") : error reading /var/lib/kubelet/pods/9b2a49c7-e726-11e8-a7a5-0050568a2af3/volume-subpaths/pvc-9b07268f-e726-11e8-a7a5-0050568a2af3/cli: lstat /var/lib/kubelet/pods/9b2a49c7-e726-11e8-a7a5-0050568a2af3/volume-subpaths/pvc-9b07268f-e726-11e8-a7a5-0050568a2af3/cli/1: stale NFS file handle"
11月 16 14:11:34 node06 kubelet[5505]: E1116 14:11:34.325843 5505 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/nfs/dc5262e0-e710-11e8-989d-0050568a2af3-pvc-66a1068f-e22d-11e8-a331-0050568a2af3\" (\"dc5262e0-e710-11e8-989d-0050568a2af3\")" failed. No retries permitted until 2018-11-16 14:13:36.325772394 +0800 CST m=+15413.723070969 (durationBeforeRetry 2m2s). Error: "error cleaning subPath mounts for volume \"pvc-66a1068f-e22d-11e8-a331-0050568a2af3\" (UniqueName: \"kubernetes.io/nfs/dc5262e0-e710-11e8-989d-0050568a2af3-pvc-66a1068f-e22d-11e8-a331-0050568a2af3\") pod \"dc5262e0-e710-11e8-989d-0050568a2af3\" (UID: \"dc5262e0-e710-11e8-989d-0050568a2af3\") : error reading /var/lib/kubelet/pods/dc5262e0-e710-11e8-989d-0050568a2af3/volume-subpaths/pvc-66a1068f-e22d-11e8-a331-0050568a2af3/org1-1000: lstat /var/lib/kubelet/pods/dc5262e0-e710-11e8-989d-0050568a2af3/volume-subpaths/pvc-66a1068f-e22d-11e8-a331-0050568a2af3/org1-1000/1: stale NFS file handle"
。。。。。。

二、问题定位

起初看到这段日志有点不知所云,Google了一下,看社区描述,有人说是Kuberentes的一个bug。尽管查阅到的资料如此显示,我还始终有个想法一致在脑海中游荡。那便是资源对象本身而言,有一种潜在的依赖关系存在。这种潜在的依赖关系可以简单描述为:对于某种具体的容器应用而言,某个deployment创建的pod是依赖于某个persistentvolumeclaim的。如果我删除这个容器应用,要是先把那个persistentvolumeclaim删除干净了,再去删除那个依赖它的deployment,Kuberentes回收资源就可能会发生上述问题,毕竟依赖的资源已经不存在了。后来经过一段时间的排查和定位,问题的本质原因正是如此。
因此该问题的本质原因可以用一句话简单概括:Kuberentes底层在真正做某个资源对象的删除操作时,发现该资源对象依赖另一个资源对象,需要解除二者的绑定关系,但是很可惜它依赖的那个资源对象已经先它一步被删除了,绑定关系解除还可能要做资源回收操作,资源不存在了,回收操作一直无法成功,目标资源对象也就一直无法成功删除。

为了便于理解我所述的这种依赖关系,给出相应的YAML,假设nginx的终止比较耗时,删除时如果不控制顺序,就可以模拟出相同的情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nginx
namespace: default
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
image: nginx:1.15.3
imagePullPolicy: IfNotPresent
name: nginx
ports:
- containerPort: 80
protocol: TCP
volumeMounts:
- mountPath: /data/
name: data
restartPolicy: Always
volumes:
- name: data
persistentVolumeClaim:
claimName: nginx-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nginx-pvc
namespace: default
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 1G
storageClassName: managed-nfs-storage

三、总结

  1. 以第二点中的YAML为例,就是说删除时,一定要先删除deployment,然后监测该deployment和其对应的pod副本都删除干净后,再去删除persistentvolumeclaim。否则一同提交给Kubernetes,如果该deployment和其对应的pod副本终止比较耗时,persistentvolumeclaim就会优先于deployment先完成删除,那么该问题就出现了。
  2. 对于deployment和job两种资源对象,检测其自身的删除状态应该就可以,实测发现自身删除了,对应的pod副本也就删除干净了;但是对于statefulset这种资源对象就不行,一定要检测对应pod副本的删除状态,实测自身删除了,对应的pod副本并没有删除干净,还要等待一会儿,才能删除干净。因此必须单独检测pod副本的删除状态,这点在写程序时,一定要格外注意。

四、代码示例:删除资源对象(Deployment)的同时,监测其删除情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
type Metadata struct {
Kind string `json:"kind"`
Name string `json:"name"`
Namespace string `json:"namespace"`
}

metadataDepList := make([]module.Metadata, 0)

for i := 0; i < len(metadataList); i++ {
metadata := metadataList[i]
glog.Info("metadata: ", metadata)
if metadata.Kind == ResourceKindDeployment {
metadataDepList = append(metadataDepList, metadata)
}
}

for i := 0; i < len(metadataDepList); i++ {
metadataDep := metadataDepList[i]
dep, err := clientset.AppsV1beta1().Deployments(metadataDep.Namespace).Get(metadataDep.Name, v1.GetOptions{})
if err != nil {
return err
}
labels := dep.Labels

sign := make(chan error, 1)
go func() {
selector := pkglabels.FormatLabels(labels)
opts := v1.ListOptions{
LabelSelector: selector,
}
depWatch, err := clientset.AppsV1beta1().Deployments(metadataDep.Namespace).Watch(opts)
if err != nil {
sign <- err
return
}

deleteDepLoop:
for {
select {
case data := <-depWatch.ResultChan():
glog.Infof("Deleted deployment.Name: %s. deployment.Namespace: %s. data.Type: %s", dep.Name, dep.Namespace, data.Type)
sign <- nil
if data.Type == watch.Deleted {
glog.Infof("Deleted deployment.Name: %s. deployment.Namespace: %s.", dep.Name, dep.Namespace)
break deleteDepLoop
}
case <-time.After(time.Duration(30) * time.Second):
sign <- errors.New("delete deployment timeout. ")
depWatch.Stop()
return
}
}
}()
err = <- sign
if err != nil {
return err
}

err = clientset.AppsV1beta1().Deployments(metadataDep.Namespace).Delete(metadataDep.Name, &v1.DeleteOptions{
PropagationPolicy: &deletePropagationBackground,
})
if err != nil {
return err
}
}

Calico的网路通信过程跟踪(IPIP CrossSubnet 模式)

一、环境版本信息

Kubernetes v1.11.0
Calico v3.1.3
Calicoctl v3.1.3

二、核心概念解惑

1. BGP协议是路由器与路由器之间的通信协议,建立在TCP上。路由器之间可以通过BGP协议交换彼此的路由信息。

A路由器 <----> BGP协议 <----> B路由器 <----> BGP协议 <----> C路由器

2. 宿主机上运行了很多的Pod,这些Pod的IP地址和通信要怎么处理?

即把宿主机变成一台路由器。宿主机变身路由器后,现实的网络是怎么联通的,Pod之间就怎么联通,技术都是现成的,并且都已经支撑起连接全地球的互联网了。

三、网络通信过程跟踪

1. 以Daemonset形式,在default下部署一组Pod,用于测试Pod的跨主机网络通信:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
server01 Ready master 128d v1.11.0 172.16.170.128 <none> CentOS Linux 7 (Core) 3.10.0-862.11.6.el7.x86_64 docker://17.3.1
server02 Ready <none> 128d v1.11.0 172.16.170.129 <none> CentOS Linux 7 (Core) 3.10.0-514.el7.x86_64 docker://17.3.1
server03 Ready <none> 128d v1.11.0 172.16.170.130 <none> CentOS Linux 7 (Core) 3.10.0-514.el7.x86_64 docker://17.3.1

# kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default network-5z4qp 1/1 Running 0 12s 10.211.0.6 server01
default network-6k87z 1/1 Running 0 12s 10.211.2.4 server03
default network-mngxw 1/1 Running 0 12s 10.211.1.4 server02
kube-system calico-node-644wq 2/2 Running 0 128d 172.16.170.128 server01
kube-system calico-node-hdkf6 2/2 Running 0 128d 172.16.170.130 server03
kube-system calico-node-wltgp 2/2 Running 0 128d 172.16.170.129 server02
kube-system coredns-777d78ff6f-6sjt5 1/1 Running 0 128d 10.211.0.3 server01
kube-system coredns-777d78ff6f-mr977 1/1 Running 0 128d 10.211.0.2 server01
kube-system etcd-server01 1/1 Running 0 128d 172.16.170.128 server01
kube-system kube-apiserver-server01 1/1 Running 0 128d 172.16.170.128 server01
kube-system kube-controller-manager-server01 1/1 Running 0 128d 172.16.170.128 server01
kube-system kube-proxy-94k52 1/1 Running 0 128d 172.16.170.129 server02
kube-system kube-proxy-czg29 1/1 Running 0 128d 172.16.170.128 server01
kube-system kube-proxy-mnhrb 1/1 Running 0 128d 172.16.170.130 server03
kube-system kube-scheduler-server01 1/1 Running 0 128d 172.16.170.128 server01

2. 选取master节点的calico-node,进入安装calicoctl命令行工具:

1
2
3
# curl -O -L https://github.com/projectcalico/calicoctl/releases/download/v3.1.3/calicoctl
# kubectl cp calicoctl calico-node-644wq:/usr/local/bin/ -n kube-system -c calico-node
# kubectl exec calico-node-644wq -n kube-system -c calico-node -- chmod 0755 /usr/local/bin/calicoctl

3. 进入master节点的calico-node,使用calicoctl查看需要关注的工作负载的相关信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# kubectl exec -it calico-node-644wq /bin/sh -n kube-system -c calico-node
/ # calicoctl get workloadendpoint -n default
NAMESPACE WORKLOAD NODE NETWORKS INTERFACE
default network-5z4qp server01 10.211.0.6/32 cali50914021272
default network-6k87z server03 10.211.2.4/32 calif17d1193010
default network-mngxw server02 10.211.1.4/32 calid406b8b6c93

/ # calicoctl get workloadendpoint -n default -o yaml
apiVersion: projectcalico.org/v3
items:
- apiVersion: projectcalico.org/v3
kind: WorkloadEndpoint
metadata:
creationTimestamp: 2019-01-16T06:46:19Z
labels:
app: network
controller-revision-hash: "66086030"
pod-template-generation: "1"
projectcalico.org/namespace: default
projectcalico.org/orchestrator: k8s
name: server01-k8s-network--5z4qp-eth0
namespace: default
resourceVersion: "25227"
uid: 6df53e81-195a-11e9-b5d1-000c2921eba0
spec:
endpoint: eth0
interfaceName: cali50914021272
ipNetworks:
- 10.211.0.6/32
node: server01
orchestrator: k8s
pod: network-5z4qp
profiles:
- kns.default
- apiVersion: projectcalico.org/v3
kind: WorkloadEndpoint
metadata:
creationTimestamp: 2019-01-16T06:46:19Z
labels:
app: network
controller-revision-hash: "66086030"
pod-template-generation: "1"
projectcalico.org/namespace: default
projectcalico.org/orchestrator: k8s
name: server03-k8s-network--6k87z-eth0
namespace: default
resourceVersion: "25084"
uid: 6dfe39ec-195a-11e9-b5d1-000c2921eba0
spec:
endpoint: eth0
interfaceName: calif17d1193010
ipNetworks:
- 10.211.2.4/32
node: server03
orchestrator: k8s
pod: network-6k87z
profiles:
- kns.default
- apiVersion: projectcalico.org/v3
kind: WorkloadEndpoint
metadata:
creationTimestamp: 2019-01-16T06:46:19Z
labels:
app: network
controller-revision-hash: "66086030"
pod-template-generation: "1"
projectcalico.org/namespace: default
projectcalico.org/orchestrator: k8s
name: server02-k8s-network--mngxw-eth0
namespace: default
resourceVersion: "25082"
uid: 6e0213f6-195a-11e9-b5d1-000c2921eba0
spec:
endpoint: eth0
interfaceName: calid406b8b6c93
ipNetworks:
- 10.211.1.4/32
node: server02
orchestrator: k8s
pod: network-mngxw
profiles:
- kns.default
kind: WorkloadEndpointList
metadata: {}

/ # calicoctl get workloadendpoint server01-k8s-network--5z4qp-eth0 -n default -o yaml
apiVersion: projectcalico.org/v3
kind: WorkloadEndpoint
metadata:
creationTimestamp: 2019-01-16T06:46:19Z
labels:
app: network
controller-revision-hash: "66086030"
pod-template-generation: "1"
projectcalico.org/namespace: default
projectcalico.org/orchestrator: k8s
name: server01-k8s-network--5z4qp-eth0
namespace: default
resourceVersion: "25227"
uid: 6df53e81-195a-11e9-b5d1-000c2921eba0
spec:
endpoint: eth0
interfaceName: cali50914021272
ipNetworks:
- 10.211.0.6/32
node: server01
orchestrator: k8s
pod: network-5z4qp
profiles:
- kns.default

/ # exit

4. 选取master节点上的用于测试网络通信的Pod,进入查看它的网络配置信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 4e:30:25:dc:21:88 brd ff:ff:ff:ff:ff:ff link-netnsid 0
/ # ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 4e:30:25:dc:21:88 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.211.0.6/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::4c30:25ff:fedc:2188/64 scope link
valid_lft forever preferred_lft forever
/ # ip neigh
/ # ping -c 3 10.211.0.1
PING 10.211.0.1 (10.211.0.1) 56(84) bytes of data.
64 bytes from 10.211.0.1: icmp_seq=1 ttl=64 time=0.085 ms
64 bytes from 10.211.0.1: icmp_seq=2 ttl=64 time=0.072 ms
64 bytes from 10.211.0.1: icmp_seq=3 ttl=64 time=0.069 ms

--- 10.211.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 0.069/0.075/0.085/0.009 ms
/ # ip neigh
169.254.1.1 dev eth0 lladdr ee:ee:ee:ee:ee:ee REACHABLE
/ # exit

5. 回到master节点上,查看calico的workload的yaml中cali50914021272对应的设备信息和所有设备的地址信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# ip link show cali50914021272
11: cali50914021272@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 2

# ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:21:eb:a0 brd ff:ff:ff:ff:ff:ff
inet 172.16.170.128/24 brd 172.16.170.255 scope global ens33
valid_lft forever preferred_lft forever
inet6 fe80::20c:29ff:fe21:eba0/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:ca:a4:09:a5 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:caff:fea4:9a5/64 scope link
valid_lft forever preferred_lft forever
4: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
inet 10.211.0.1/32 brd 10.211.0.1 scope global tunl0
valid_lft forever preferred_lft forever
5: cali009d9b46eef@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
6: calib99e709bd2c@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
11: cali50914021272@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 2
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever

6. 验证master上的Pod的网络数据包是否可以发送到master上:(其中10.211.0.1为tunl0设备的地址)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 开启一个新的终端执行
# tcpdump -i cali50914021272 icmp -v
tcpdump: listening on cali50914021272, link-type EN10MB (Ethernet), capture size 262144 bytes
02:00:24.930193 IP (tos 0x0, ttl 64, id 51910, offset 0, flags [DF], proto ICMP (1), length 84)
10.211.0.6 > server01: ICMP echo request, id 18, seq 1, length 64
02:00:24.930235 IP (tos 0x0, ttl 64, id 22594, offset 0, flags [none], proto ICMP (1), length 84)
server01 > 10.211.0.6: ICMP echo reply, id 18, seq 1, length 64

# 在原有终端上执行
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ping -c 1 10.211.0.1
PING 10.211.0.1 (10.211.0.1) 56(84) bytes of data.
64 bytes from 10.211.0.1: icmp_seq=1 ttl=64 time=0.081 ms

--- 10.211.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.081/0.081/0.081/0.000 ms
/ # exit

7. 查看master上的路由表,看我们即将测试的目标Pod(地址为10.211.1.4)在master上对应的下一跳在哪里:

1
2
3
4
5
6
7
8
9
10
# ip route
...
blackhole 10.211.0.0/24 proto bird
10.211.0.2 dev cali009d9b46eef scope link
10.211.0.3 dev calib99e709bd2c scope link
10.211.0.6 dev cali50914021272 scope link
...
...
10.211.1.0/24 via 172.16.170.129 dev ens33 proto bird
...

8. 验证发往目标Pod(地址为10.211.1.4)的网络数据包是否发往master的主机网卡ens33上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 开启一个新的终端执行
# tcpdump -i ens33 icmp -v
tcpdump: listening on ens33, link-type EN10MB (Ethernet), capture size 262144 bytes
02:30:22.537037 IP (tos 0x0, ttl 63, id 23441, offset 0, flags [DF], proto ICMP (1), length 84)
10.211.0.6 > 10.211.1.4: ICMP echo request, id 23, seq 1, length 64
02:30:22.537476 IP (tos 0x0, ttl 63, id 62650, offset 0, flags [none], proto ICMP (1), length 84)
10.211.1.4 > 10.211.0.6: ICMP echo reply, id 23, seq 1, length 64

# 在原有终端上执行
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ping -c 1 10.211.1.4
PING 10.211.1.4 (10.211.1.4) 56(84) bytes of data.
64 bytes from 10.211.1.4: icmp_seq=1 ttl=62 time=0.526 ms

--- 10.211.1.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.526/0.526/0.526/0.000 ms
/ # exit

9. 验证发往目标Pod(地址为10.211.1.4)的网络数据包是否发送到Node(地址为172.16.170.129)的网卡ens33上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 在地址为10.211.1.4的Pod所在的Node上
# tcpdump -i ens33 icmp -v
tcpdump: listening on ens33, link-type EN10MB (Ethernet), capture size 262144 bytes
02:33:17.808254 IP (tos 0x0, ttl 63, id 61238, offset 0, flags [DF], proto ICMP (1), length 84)
10.211.0.6 > 10.211.1.4: ICMP echo request, id 33, seq 1, length 64
02:33:17.808415 IP (tos 0x0, ttl 63, id 1878, offset 0, flags [none], proto ICMP (1), length 84)
10.211.1.4 > 10.211.0.6: ICMP echo reply, id 33, seq 1, length 64

# 回到地址为10.211.0.6的Pod所在的Node上
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ping -c 1 10.211.1.4
PING 10.211.1.4 (10.211.1.4) 56(84) bytes of data.
64 bytes from 10.211.1.4: icmp_seq=1 ttl=62 time=0.556 ms

--- 10.211.1.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.556/0.556/0.556/0.000 ms
/ # exit

10. 在Node(地址为172.16.170.129)上查看发往目标Pod(地址为10.211.1.4)的下一跳在哪里:

1
2
3
4
# ip route
...
10.211.1.4 dev calid406b8b6c93 scope link
...

11. 验证发往目标Pod(地址为10.211.1.4)的网络数据包是否发送到Node(地址为172.16.170.129)的网络设备calid406b8b6c93上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 在地址为10.211.1.4的Pod所在的Node上
# ip link show calid406b8b6c93
7: calid406b8b6c93@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0

# tcpdump -i calid406b8b6c93 icmp -v
tcpdump: listening on calid406b8b6c93, link-type EN10MB (Ethernet), capture size 262144 bytes
02:39:15.303283 IP (tos 0x0, ttl 62, id 44511, offset 0, flags [DF], proto ICMP (1), length 84)
10.211.0.6 > 10.211.1.4: ICMP echo request, id 38, seq 1, length 64
02:39:15.303392 IP (tos 0x0, ttl 64, id 47955, offset 0, flags [none], proto ICMP (1), length 84)
10.211.1.4 > 10.211.0.6: ICMP echo reply, id 38, seq 1, length 64


# 回到地址为10.211.0.6的Pod所在的Node上(即master上),开启一个新的终端
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ping -c 1 10.211.1.4
PING 10.211.1.4 (10.211.1.4) 56(84) bytes of data.
64 bytes from 10.211.1.4: icmp_seq=1 ttl=62 time=0.602 ms

--- 10.211.1.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.602/0.602/0.602/0.000 ms
/ # exit

12. 验证发往目标Pod(地址为10.211.1.4)的网络数据包是否发送到它的网络设备eth0上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# kubectl exec -it network-mngxw /bin/sh -n default
/ # ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether e2:89:27:9e:6f:a6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
/ # ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether e2:89:27:9e:6f:a6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.211.1.4/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::e089:27ff:fe9e:6fa6/64 scope link
valid_lft forever preferred_lft forever
/ # tcpdump -i eth0 icmp -v
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
07:42:45.870953 IP (tos 0x0, ttl 62, id 13540, offset 0, flags [DF], proto ICMP (1), length 84)
10.211.0.6 > network-mngxw: ICMP echo request, id 43, seq 1, length 64
07:42:45.871065 IP (tos 0x0, ttl 64, id 19616, offset 0, flags [none], proto ICMP (1), length 84)
network-mngxw > 10.211.0.6: ICMP echo reply, id 43, seq 1, length 64

# 回到地址为10.211.0.6的Pod所在的Node上(即master上)
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ping -c 1 10.211.1.4
PING 10.211.1.4 (10.211.1.4) 56(84) bytes of data.
64 bytes from 10.211.1.4: icmp_seq=1 ttl=62 time=0.602 ms

--- 10.211.1.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.602/0.602/0.602/0.000 ms
/ # exit

四、参考资料

https://docs.projectcalico.org/v3.1/usage/calicoctl/install
https://mp.weixin.qq.com/s/MZIj_cvvtTiAfNf_0lpfTg
https://mp.weixin.qq.com/s/oKxsWDTvoLeOSHAuPIxnGw
https://blog.csdn.net/ccy19910925/article/details/82424275

如何清理kubelet产生的挂载点?

有时Kubernetes节点故障,需要从集群中移除,修复后再加入集群。在这个修复过程中,清理Kubelet相关挂载点是个非常重要的操作。在实际使用过程中,我发现kubeadm工具的reset子命令中有这个操作,因此特意查阅了Kubeadm的源码,找到了源码里的实现方法,在这里做个简单的记录。

清理kubelet相关挂载点的命令如下:

1
2
3
4
5
6
7
8
# 停止Kubernetes的某一个Node节点
systemctl stop kubelet.service && systemctl stop docker.service

# 清理Kubelet的相关挂载点
awk '$2 ~ path {print $2}' path=/var/lib/kubelet /proc/mounts | xargs -r umount

# 恢复Kubernetes的某一个Node节点
systemctl start docker.service && systemctl start kubelet.service

参考资料:
https://github.com/kubernetes/kubernetes/blob/v1.11.0/cmd/kubeadm/app/cmd/reset.go