Calico的网路通信过程跟踪(IPIP CrossSubnet 模式)

一、环境版本信息

Kubernetes v1.11.0
Calico v3.1.3
Calicoctl v3.1.3

二、核心概念解惑

1. BGP协议是路由器与路由器之间的通信协议,建立在TCP上。路由器之间可以通过BGP协议交换彼此的路由信息。

A路由器 <----> BGP协议 <----> B路由器 <----> BGP协议 <----> C路由器

2. 宿主机上运行了很多的Pod,这些Pod的IP地址和通信要怎么处理?

即把宿主机变成一台路由器。宿主机变身路由器后,现实的网络是怎么联通的,Pod之间就怎么联通,技术都是现成的,并且都已经支撑起连接全地球的互联网了。

三、网络通信过程跟踪

1. 以Daemonset形式,在default下部署一组Pod,用于测试Pod的跨主机网络通信:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# kubectl get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
server01 Ready master 128d v1.11.0 172.16.170.128 <none> CentOS Linux 7 (Core) 3.10.0-862.11.6.el7.x86_64 docker://17.3.1
server02 Ready <none> 128d v1.11.0 172.16.170.129 <none> CentOS Linux 7 (Core) 3.10.0-514.el7.x86_64 docker://17.3.1
server03 Ready <none> 128d v1.11.0 172.16.170.130 <none> CentOS Linux 7 (Core) 3.10.0-514.el7.x86_64 docker://17.3.1

# kubectl get pod --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default network-5z4qp 1/1 Running 0 12s 10.211.0.6 server01
default network-6k87z 1/1 Running 0 12s 10.211.2.4 server03
default network-mngxw 1/1 Running 0 12s 10.211.1.4 server02
kube-system calico-node-644wq 2/2 Running 0 128d 172.16.170.128 server01
kube-system calico-node-hdkf6 2/2 Running 0 128d 172.16.170.130 server03
kube-system calico-node-wltgp 2/2 Running 0 128d 172.16.170.129 server02
kube-system coredns-777d78ff6f-6sjt5 1/1 Running 0 128d 10.211.0.3 server01
kube-system coredns-777d78ff6f-mr977 1/1 Running 0 128d 10.211.0.2 server01
kube-system etcd-server01 1/1 Running 0 128d 172.16.170.128 server01
kube-system kube-apiserver-server01 1/1 Running 0 128d 172.16.170.128 server01
kube-system kube-controller-manager-server01 1/1 Running 0 128d 172.16.170.128 server01
kube-system kube-proxy-94k52 1/1 Running 0 128d 172.16.170.129 server02
kube-system kube-proxy-czg29 1/1 Running 0 128d 172.16.170.128 server01
kube-system kube-proxy-mnhrb 1/1 Running 0 128d 172.16.170.130 server03
kube-system kube-scheduler-server01 1/1 Running 0 128d 172.16.170.128 server01

2. 选取master节点的calico-node,进入安装calicoctl命令行工具:

1
2
3
# curl -O -L https://github.com/projectcalico/calicoctl/releases/download/v3.1.3/calicoctl
# kubectl cp calicoctl calico-node-644wq:/usr/local/bin/ -n kube-system -c calico-node
# kubectl exec calico-node-644wq -n kube-system -c calico-node -- chmod 0755 /usr/local/bin/calicoctl

3. 进入master节点的calico-node,使用calicoctl查看需要关注的工作负载的相关信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# kubectl exec -it calico-node-644wq /bin/sh -n kube-system -c calico-node
/ # calicoctl get workloadendpoint -n default
NAMESPACE WORKLOAD NODE NETWORKS INTERFACE
default network-5z4qp server01 10.211.0.6/32 cali50914021272
default network-6k87z server03 10.211.2.4/32 calif17d1193010
default network-mngxw server02 10.211.1.4/32 calid406b8b6c93

/ # calicoctl get workloadendpoint -n default -o yaml
apiVersion: projectcalico.org/v3
items:
- apiVersion: projectcalico.org/v3
kind: WorkloadEndpoint
metadata:
creationTimestamp: 2019-01-16T06:46:19Z
labels:
app: network
controller-revision-hash: "66086030"
pod-template-generation: "1"
projectcalico.org/namespace: default
projectcalico.org/orchestrator: k8s
name: server01-k8s-network--5z4qp-eth0
namespace: default
resourceVersion: "25227"
uid: 6df53e81-195a-11e9-b5d1-000c2921eba0
spec:
endpoint: eth0
interfaceName: cali50914021272
ipNetworks:
- 10.211.0.6/32
node: server01
orchestrator: k8s
pod: network-5z4qp
profiles:
- kns.default
- apiVersion: projectcalico.org/v3
kind: WorkloadEndpoint
metadata:
creationTimestamp: 2019-01-16T06:46:19Z
labels:
app: network
controller-revision-hash: "66086030"
pod-template-generation: "1"
projectcalico.org/namespace: default
projectcalico.org/orchestrator: k8s
name: server03-k8s-network--6k87z-eth0
namespace: default
resourceVersion: "25084"
uid: 6dfe39ec-195a-11e9-b5d1-000c2921eba0
spec:
endpoint: eth0
interfaceName: calif17d1193010
ipNetworks:
- 10.211.2.4/32
node: server03
orchestrator: k8s
pod: network-6k87z
profiles:
- kns.default
- apiVersion: projectcalico.org/v3
kind: WorkloadEndpoint
metadata:
creationTimestamp: 2019-01-16T06:46:19Z
labels:
app: network
controller-revision-hash: "66086030"
pod-template-generation: "1"
projectcalico.org/namespace: default
projectcalico.org/orchestrator: k8s
name: server02-k8s-network--mngxw-eth0
namespace: default
resourceVersion: "25082"
uid: 6e0213f6-195a-11e9-b5d1-000c2921eba0
spec:
endpoint: eth0
interfaceName: calid406b8b6c93
ipNetworks:
- 10.211.1.4/32
node: server02
orchestrator: k8s
pod: network-mngxw
profiles:
- kns.default
kind: WorkloadEndpointList
metadata: {}

/ # calicoctl get workloadendpoint server01-k8s-network--5z4qp-eth0 -n default -o yaml
apiVersion: projectcalico.org/v3
kind: WorkloadEndpoint
metadata:
creationTimestamp: 2019-01-16T06:46:19Z
labels:
app: network
controller-revision-hash: "66086030"
pod-template-generation: "1"
projectcalico.org/namespace: default
projectcalico.org/orchestrator: k8s
name: server01-k8s-network--5z4qp-eth0
namespace: default
resourceVersion: "25227"
uid: 6df53e81-195a-11e9-b5d1-000c2921eba0
spec:
endpoint: eth0
interfaceName: cali50914021272
ipNetworks:
- 10.211.0.6/32
node: server01
orchestrator: k8s
pod: network-5z4qp
profiles:
- kns.default

/ # exit

4. 选取master节点上的用于测试网络通信的Pod,进入查看它的网络配置信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 4e:30:25:dc:21:88 brd ff:ff:ff:ff:ff:ff link-netnsid 0
/ # ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 4e:30:25:dc:21:88 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.211.0.6/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::4c30:25ff:fedc:2188/64 scope link
valid_lft forever preferred_lft forever
/ # ip neigh
/ # ping -c 3 10.211.0.1
PING 10.211.0.1 (10.211.0.1) 56(84) bytes of data.
64 bytes from 10.211.0.1: icmp_seq=1 ttl=64 time=0.085 ms
64 bytes from 10.211.0.1: icmp_seq=2 ttl=64 time=0.072 ms
64 bytes from 10.211.0.1: icmp_seq=3 ttl=64 time=0.069 ms

--- 10.211.0.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 0.069/0.075/0.085/0.009 ms
/ # ip neigh
169.254.1.1 dev eth0 lladdr ee:ee:ee:ee:ee:ee REACHABLE
/ # exit

5. 回到master节点上,查看calico的workload的yaml中cali50914021272对应的设备信息和所有设备的地址信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# ip link show cali50914021272
11: cali50914021272@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 2

# ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:21:eb:a0 brd ff:ff:ff:ff:ff:ff
inet 172.16.170.128/24 brd 172.16.170.255 scope global ens33
valid_lft forever preferred_lft forever
inet6 fe80::20c:29ff:fe21:eba0/64 scope link
valid_lft forever preferred_lft forever
3: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:ca:a4:09:a5 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:caff:fea4:9a5/64 scope link
valid_lft forever preferred_lft forever
4: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
inet 10.211.0.1/32 brd 10.211.0.1 scope global tunl0
valid_lft forever preferred_lft forever
5: cali009d9b46eef@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
6: calib99e709bd2c@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
11: cali50914021272@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 2
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever

6. 验证master上的Pod的网络数据包是否可以发送到master上:(其中10.211.0.1为tunl0设备的地址)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 开启一个新的终端执行
# tcpdump -i cali50914021272 icmp -v
tcpdump: listening on cali50914021272, link-type EN10MB (Ethernet), capture size 262144 bytes
02:00:24.930193 IP (tos 0x0, ttl 64, id 51910, offset 0, flags [DF], proto ICMP (1), length 84)
10.211.0.6 > server01: ICMP echo request, id 18, seq 1, length 64
02:00:24.930235 IP (tos 0x0, ttl 64, id 22594, offset 0, flags [none], proto ICMP (1), length 84)
server01 > 10.211.0.6: ICMP echo reply, id 18, seq 1, length 64

# 在原有终端上执行
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ping -c 1 10.211.0.1
PING 10.211.0.1 (10.211.0.1) 56(84) bytes of data.
64 bytes from 10.211.0.1: icmp_seq=1 ttl=64 time=0.081 ms

--- 10.211.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.081/0.081/0.081/0.000 ms
/ # exit

7. 查看master上的路由表,看我们即将测试的目标Pod(地址为10.211.1.4)在master上对应的下一跳在哪里:

1
2
3
4
5
6
7
8
9
10
# ip route
...
blackhole 10.211.0.0/24 proto bird
10.211.0.2 dev cali009d9b46eef scope link
10.211.0.3 dev calib99e709bd2c scope link
10.211.0.6 dev cali50914021272 scope link
...
...
10.211.1.0/24 via 172.16.170.129 dev ens33 proto bird
...

8. 验证发往目标Pod(地址为10.211.1.4)的网络数据包是否发往master的主机网卡ens33上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 开启一个新的终端执行
# tcpdump -i ens33 icmp -v
tcpdump: listening on ens33, link-type EN10MB (Ethernet), capture size 262144 bytes
02:30:22.537037 IP (tos 0x0, ttl 63, id 23441, offset 0, flags [DF], proto ICMP (1), length 84)
10.211.0.6 > 10.211.1.4: ICMP echo request, id 23, seq 1, length 64
02:30:22.537476 IP (tos 0x0, ttl 63, id 62650, offset 0, flags [none], proto ICMP (1), length 84)
10.211.1.4 > 10.211.0.6: ICMP echo reply, id 23, seq 1, length 64

# 在原有终端上执行
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ping -c 1 10.211.1.4
PING 10.211.1.4 (10.211.1.4) 56(84) bytes of data.
64 bytes from 10.211.1.4: icmp_seq=1 ttl=62 time=0.526 ms

--- 10.211.1.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.526/0.526/0.526/0.000 ms
/ # exit

9. 验证发往目标Pod(地址为10.211.1.4)的网络数据包是否发送到Node(地址为172.16.170.129)的网卡ens33上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 在地址为10.211.1.4的Pod所在的Node上
# tcpdump -i ens33 icmp -v
tcpdump: listening on ens33, link-type EN10MB (Ethernet), capture size 262144 bytes
02:33:17.808254 IP (tos 0x0, ttl 63, id 61238, offset 0, flags [DF], proto ICMP (1), length 84)
10.211.0.6 > 10.211.1.4: ICMP echo request, id 33, seq 1, length 64
02:33:17.808415 IP (tos 0x0, ttl 63, id 1878, offset 0, flags [none], proto ICMP (1), length 84)
10.211.1.4 > 10.211.0.6: ICMP echo reply, id 33, seq 1, length 64

# 回到地址为10.211.0.6的Pod所在的Node上
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ping -c 1 10.211.1.4
PING 10.211.1.4 (10.211.1.4) 56(84) bytes of data.
64 bytes from 10.211.1.4: icmp_seq=1 ttl=62 time=0.556 ms

--- 10.211.1.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.556/0.556/0.556/0.000 ms
/ # exit

10. 在Node(地址为172.16.170.129)上查看发往目标Pod(地址为10.211.1.4)的下一跳在哪里:

1
2
3
4
# ip route
...
10.211.1.4 dev calid406b8b6c93 scope link
...

11. 验证发往目标Pod(地址为10.211.1.4)的网络数据包是否发送到Node(地址为172.16.170.129)的网络设备calid406b8b6c93上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 在地址为10.211.1.4的Pod所在的Node上
# ip link show calid406b8b6c93
7: calid406b8b6c93@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 0

# tcpdump -i calid406b8b6c93 icmp -v
tcpdump: listening on calid406b8b6c93, link-type EN10MB (Ethernet), capture size 262144 bytes
02:39:15.303283 IP (tos 0x0, ttl 62, id 44511, offset 0, flags [DF], proto ICMP (1), length 84)
10.211.0.6 > 10.211.1.4: ICMP echo request, id 38, seq 1, length 64
02:39:15.303392 IP (tos 0x0, ttl 64, id 47955, offset 0, flags [none], proto ICMP (1), length 84)
10.211.1.4 > 10.211.0.6: ICMP echo reply, id 38, seq 1, length 64


# 回到地址为10.211.0.6的Pod所在的Node上(即master上),开启一个新的终端
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ping -c 1 10.211.1.4
PING 10.211.1.4 (10.211.1.4) 56(84) bytes of data.
64 bytes from 10.211.1.4: icmp_seq=1 ttl=62 time=0.602 ms

--- 10.211.1.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.602/0.602/0.602/0.000 ms
/ # exit

12. 验证发往目标Pod(地址为10.211.1.4)的网络数据包是否发送到它的网络设备eth0上:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# kubectl exec -it network-mngxw /bin/sh -n default
/ # ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether e2:89:27:9e:6f:a6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
/ # ip address
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether e2:89:27:9e:6f:a6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.211.1.4/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::e089:27ff:fe9e:6fa6/64 scope link
valid_lft forever preferred_lft forever
/ # tcpdump -i eth0 icmp -v
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
07:42:45.870953 IP (tos 0x0, ttl 62, id 13540, offset 0, flags [DF], proto ICMP (1), length 84)
10.211.0.6 > network-mngxw: ICMP echo request, id 43, seq 1, length 64
07:42:45.871065 IP (tos 0x0, ttl 64, id 19616, offset 0, flags [none], proto ICMP (1), length 84)
network-mngxw > 10.211.0.6: ICMP echo reply, id 43, seq 1, length 64

# 回到地址为10.211.0.6的Pod所在的Node上(即master上)
# kubectl exec -it network-5z4qp /bin/sh -n default
/ # ping -c 1 10.211.1.4
PING 10.211.1.4 (10.211.1.4) 56(84) bytes of data.
64 bytes from 10.211.1.4: icmp_seq=1 ttl=62 time=0.602 ms

--- 10.211.1.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.602/0.602/0.602/0.000 ms
/ # exit

四、参考资料

https://docs.projectcalico.org/v3.1/usage/calicoctl/install
https://mp.weixin.qq.com/s/MZIj_cvvtTiAfNf_0lpfTg
https://mp.weixin.qq.com/s/oKxsWDTvoLeOSHAuPIxnGw
https://blog.csdn.net/ccy19910925/article/details/82424275