Prometheus Federation 实验小记

一、环境的相关信息

1. 版本信息

Docker Engine Community 19.03.1
Prometheus 2.11.1
Node exporter 0.18.1
Nginx 1.16.1

2. 服务器信息

192.168.112.131 server04
192.168.112.132 server05
192.168.112.133 server06

二、关键点说明

我们知道,Exporter通过http的形式对外提供服务,而Prometheus实例通过定时请求Exporter的http服务来获取监控指标的样本数据;而Federation同样也是通过http的形式对外提供服务,每个Prometheus实例都支持http的形式的Federation接口,顶级的Prometheus实例,通过定时请求直接下级Prometheus实例的http的形式的Federation接口来获取监控指标的样本数据。

当你面临的场景是,跨地区、跨机房甚至是跨数据中心的时候,例如想在北京做南京、成都和重庆三地的监控时,在不考虑高可用的情况下,该如何应用上述的Exporter和Federation ?
第一种,在北京部署一个Prometheus实例,然后在南京、成都和重庆三地各部署一个Exporter,把其端口暴露在公网上,通过公网IP和公网端口号定时请求http服务获取监控指标的样本数据;
第二种,在北京部署一个Prometheus实例,然后在南京、成都和重庆三地各部署一个Prometheus实例,把其端口暴露在公网上,通过公网IP和公网端口号定时请求南京、成都和重庆的http的形式的Federation接口获取其上的监控指标的样本数据。

问题来了,上述两种方式太不安全了,原因如下:
第一,http请求没有认证,所有请求无条件接受;
第二,单纯的http协议,传输报文为明文,没有加密。

想在公网上使用,我们就得做个简单的安全加固:Basic auth 和 TLS encryption。通过查询Prometheus的官方文档发现,这两个安全加固特性的支持,Prometheus需要借助Nginx这种http反向代理来实现,官方推荐我们使用Nginx。具体链接请参考后面的参考资料,这里不再啰嗦了。

为了方便搭建演示环境,我这里使用容器化的Prometheus和Node Exporter,二进制部署的Nginx,换成二进制部署的Prometheus和Node Exporter同理,这里不再啰嗦了。

下面仅演示跨地区、跨机房甚至是跨数据中心的时候如何使用Prometheus Federation。关于Prometheus通过公网拉取Exporter获取监控指标的样本数据的方法同理,关键在于Nginx的使用上。

三、过程记录

1. 下载 prometheus 相关的Docker镜像

1
2
docker pull prom/prometheus:v2.11.1
docker pull prom/node-exporter:v0.18.1

2. 下载 Nginx 的 docker 镜像(使用容器启动 Nginx 的方式。如使用二进制方式启动,则该步骤跳过)

1
docker pull nginx:1.16.1

3. 三台服务器上分别使用yum安装Nginx(使用二进制启动 Nginx 的方式。如使用容器方式启动,则该步骤跳过)

1
2
3
4
yum install -y epel-release
yum makecache fast
yum install -y nginx
yum install -y httpd-tools

4. 在server04上生成证书,然后分发所有证书文件到server05和server06的相同目录下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
mkdir -p /opt/prometheus/pki/
cd /opt/prometheus/pki/
openssl genrsa -out ca.key 2048
openssl req -x509 -new -nodes -key ca.key -subj "/CN=prometheus" -days 50000 -out ca.crt

cat <<EOF > ssl.cnf
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name

[req_distinguished_name]
[v3_req]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation,digitalSignature,keyEncipherment
subjectAltName = @alt_names
[alt_names]
DNS.1 = server04
DNS.2 = server05
DNS.3 = server06
DNS.4 = localhost
IP.1 = 192.168.112.131
IP.2 = 192.168.112.132
IP.3 = 192.168.112.133
IP.4 = 127.0.0.1
EOF

openssl genrsa -out server.key 2048
openssl req -new -key server.key -subj "/CN=prometheus-server" -config ssl.cnf -out server.csr
openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreateserial -days 50000 -extensions v3_req -extfile ssl.cnf -out server.crt

openssl genrsa -out client.key 2048
openssl req -new -key client.key -subj "/CN=prometheus-client" -out client.csr
openssl x509 -req -in client.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out client.crt -days 50000

# 分发步骤这里省略不写了

5. 在三台服务器上分别启动Nginx实例(使用二进制启动Nginx的方式。如使用容器方式启动,则该步骤跳过)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# 在server04上执行
htpasswd -c /etc/nginx/.htpasswd admin

## 修改nginx的主配置文件
http {
。。。。。。以下是修改。。。。。。
include /etc/nginx/conf.d/http_*.conf;
。。。。。。以下是新增。。。。。。
stream {
include /etc/nginx/conf.d/stream_*.conf;
}


cat <<EOF > /etc/nginx/conf.d/http_prometheus.conf
upstream prometheus {
server 192.168.112.131:9090;
}

server {
listen 19090 ssl;
server_name 192.168.112.131;
ssl_certificate /opt/prometheus/pki/server.crt;
ssl_certificate_key /opt/prometheus/pki/server.key;
access_log /var/log/nginx/prometheus-access.log main;
error_log /var/log/nginx/prometheus-error.log;
add_header Cache-Control no-cache;

location / {
auth_basic "prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://prometheus/;
proxy_http_version 1.1;
proxy_connect_timeout 30m;
proxy_send_timeout 30m;
proxy_read_timeout 30m;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $http_connection;
proxy_buffering off;
}
}

EOF

systemctl enable nginx.service
systemctl start nginx.service
systemctl status nginx.service

# 在server05上执行
htpasswd -c /etc/nginx/.htpasswd admin

## 修改nginx的主配置文件
http {
。。。。。。以下是修改。。。。。。
include /etc/nginx/conf.d/http_*.conf;
。。。。。。以下是新增。。。。。。
stream {
include /etc/nginx/conf.d/stream_*.conf;
}


cat <<EOF > /etc/nginx/conf.d/http_prometheus.conf
upstream prometheus {
server 192.168.112.132:9090;
}

server {
listen 19090 ssl;
server_name 192.168.112.132;
ssl_certificate /opt/prometheus/pki/server.crt;
ssl_certificate_key /opt/prometheus/pki/server.key;
access_log /var/log/nginx/prometheus-access.log main;
error_log /var/log/nginx/prometheus-error.log;
add_header Cache-Control no-cache;

location / {
auth_basic "prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://prometheus/;
proxy_http_version 1.1;
proxy_connect_timeout 30m;
proxy_send_timeout 30m;
proxy_read_timeout 30m;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $http_connection;
proxy_buffering off;
}
}

EOF

systemctl enable nginx.service
systemctl start nginx.service
systemctl status nginx.service

# 在server06上执行
htpasswd -c /etc/nginx/.htpasswd admin

## 修改nginx的主配置文件
http {
。。。。。。以下是修改。。。。。。
include /etc/nginx/conf.d/http_*.conf;
。。。。。。以下是新增。。。。。。
stream {
include /etc/nginx/conf.d/stream_*.conf;
}


cat <<EOF > /etc/nginx/conf.d/http_prometheus.conf
upstream prometheus {
server 192.168.112.133:9090;
}

server {
listen 19090 ssl;
server_name 192.168.112.133;
ssl_certificate /opt/prometheus/pki/server.crt;
ssl_certificate_key /opt/prometheus/pki/server.key;
access_log /var/log/nginx/prometheus-access.log main;
error_log /var/log/nginx/prometheus-error.log;
add_header Cache-Control no-cache;

location / {
auth_basic "prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://prometheus/;
proxy_http_version 1.1;
proxy_connect_timeout 30m;
proxy_send_timeout 30m;
proxy_read_timeout 30m;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection $http_connection;
proxy_buffering off;
}
}

EOF

systemctl enable nginx.service
systemctl start nginx.service
systemctl status nginx.service

6. 在三台服务器上分别启动Nginx容器(使用容器启动Nginx的方式。如使用二进制方式启动,则该步骤跳过)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
# 在server04上执行
mkdir -p nginx/etc/nginx/conf.d/
cp -r /opt/prometheus/pki nginx/etc/nginx/
cp -r /etc/nginx/.htpasswd nginx/etc/nginx/

cat <<EOF > nginx/etc/nginx/nginx.conf

user nginx;
worker_processes 1;

error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;


events {
worker_connections 1024;
}


http {
include /etc/nginx/mime.types;
default_type application/octet-stream;

log_format main '\$remote_addr - \$remote_user [\$time_local] "\$request" '
'\$status \$body_bytes_sent "\$http_referer" '
'"\$http_user_agent" "\$http_x_forwarded_for"';

access_log /var/log/nginx/access.log main;

sendfile on;
#tcp_nopush on;

keepalive_timeout 65;

#gzip on;

include /etc/nginx/conf.d/http_*.conf;
}

stream {
include /etc/nginx/conf.d/stream_*.conf;
}

EOF

cat <<EOF > nginx/etc/nginx/conf.d/http_prom.conf
upstream prometheus {
server 192.168.112.131:9090;
}

server {
listen 19090 ssl;
server_name 192.168.112.131;
ssl_certificate /etc/nginx/pki/server.crt;
ssl_certificate_key /etc/nginx/pki/server.key;
add_header Cache-Control no-cache;

location / {
auth_basic "prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://prometheus/;
proxy_http_version 1.1;
proxy_connect_timeout 30m;
proxy_send_timeout 30m;
proxy_read_timeout 30m;
proxy_set_header Upgrade \$http_upgrade;
proxy_set_header Connection \$http_connection;
proxy_buffering off;
}
}

EOF

cd nginx/

docker run -d -p 19090:19090 -v `pwd`/etc/nginx/nginx.conf:/etc/nginx/nginx.conf -v `pwd`/etc/nginx/.htpasswd:/etc/nginx/.htpasswd -v `pwd`/etc/nginx/pki/:/etc/nginx/pki/ -v `pwd`/etc/nginx/conf.d/:/etc/nginx/conf.d/ -v /etc/localtime:/etc/localtime --name nginx nginx:1.16.1

# 在server05上执行
mkdir -p nginx/etc/nginx/conf.d/
cp -r /opt/prometheus/pki nginx/etc/nginx/
cp -r /etc/nginx/.htpasswd nginx/etc/nginx/

cat <<EOF > nginx/etc/nginx/nginx.conf

user nginx;
worker_processes 1;

error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;


events {
worker_connections 1024;
}


http {
include /etc/nginx/mime.types;
default_type application/octet-stream;

log_format main '\$remote_addr - \$remote_user [\$time_local] "\$request" '
'\$status \$body_bytes_sent "\$http_referer" '
'"\$http_user_agent" "\$http_x_forwarded_for"';

access_log /var/log/nginx/access.log main;

sendfile on;
#tcp_nopush on;

keepalive_timeout 65;

#gzip on;

include /etc/nginx/conf.d/http_*.conf;
}

stream {
include /etc/nginx/conf.d/stream_*.conf;
}

EOF

cat <<EOF > nginx/etc/nginx/conf.d/http_prom.conf
upstream prometheus {
server 192.168.112.132:9090;
}

server {
listen 19090 ssl;
server_name 192.168.112.132;
ssl_certificate /etc/nginx/pki/server.crt;
ssl_certificate_key /etc/nginx/pki/server.key;
add_header Cache-Control no-cache;

location / {
auth_basic "prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://prometheus/;
proxy_http_version 1.1;
proxy_connect_timeout 30m;
proxy_send_timeout 30m;
proxy_read_timeout 30m;
proxy_set_header Upgrade \$http_upgrade;
proxy_set_header Connection \$http_connection;
proxy_buffering off;
}
}

EOF

cd nginx/

docker run -d -p 19090:19090 -v `pwd`/etc/nginx/nginx.conf:/etc/nginx/nginx.conf -v `pwd`/etc/nginx/.htpasswd:/etc/nginx/.htpasswd -v `pwd`/etc/nginx/pki/:/etc/nginx/pki/ -v `pwd`/etc/nginx/conf.d/:/etc/nginx/conf.d/ -v /etc/localtime:/etc/localtime --name nginx nginx:1.16.1

# 在server06上执行
mkdir -p nginx/etc/nginx/conf.d/
cp -r /opt/prometheus/pki nginx/etc/nginx/
cp -r /etc/nginx/.htpasswd nginx/etc/nginx/

cat <<EOF > nginx/etc/nginx/nginx.conf

user nginx;
worker_processes 1;

error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;


events {
worker_connections 1024;
}


http {
include /etc/nginx/mime.types;
default_type application/octet-stream;

log_format main '\$remote_addr - \$remote_user [\$time_local] "\$request" '
'\$status \$body_bytes_sent "\$http_referer" '
'"\$http_user_agent" "\$http_x_forwarded_for"';

access_log /var/log/nginx/access.log main;

sendfile on;
#tcp_nopush on;

keepalive_timeout 65;

#gzip on;

include /etc/nginx/conf.d/http_*.conf;
}

stream {
include /etc/nginx/conf.d/stream_*.conf;
}

EOF

cat <<EOF > nginx/etc/nginx/conf.d/http_prom.conf
upstream prometheus {
server 192.168.112.133:9090;
}

server {
listen 19090 ssl;
server_name 192.168.112.133;
ssl_certificate /etc/nginx/pki/server.crt;
ssl_certificate_key /etc/nginx/pki/server.key;
add_header Cache-Control no-cache;

location / {
auth_basic "prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://prometheus/;
proxy_http_version 1.1;
proxy_connect_timeout 30m;
proxy_send_timeout 30m;
proxy_read_timeout 30m;
proxy_set_header Upgrade \$http_upgrade;
proxy_set_header Connection \$http_connection;
proxy_buffering off;
}
}

EOF

cd nginx/
docker run -d -p 19090:19090 -v `pwd`/etc/nginx/nginx.conf:/etc/nginx/nginx.conf -v `pwd`/etc/nginx/.htpasswd:/etc/nginx/.htpasswd -v `pwd`/etc/nginx/pki/:/etc/nginx/pki/ -v `pwd`/etc/nginx/conf.d/:/etc/nginx/conf.d/ -v /etc/localtime:/etc/localtime --name nginx nginx:1.16.1

7. 在三台服务器上分别启动Prometheus实例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# 在server04上执行
cat <<EOF > /opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'server04'
scrape_interval: 15s
basic_auth:
username: admin
password: 12345678
tls_config:
ca_file: /etc/prometheus/pki/ca.crt
cert_file: /etc/prometheus/pki/client.crt
key_file: /etc/prometheus/pki/client.key

honor_labels: true
metrics_path: '/federate'
scheme: https
params:
'match[]':
- '{job=~"server.*"}'

static_configs:
- targets:
- '192.168.112.132:19090'
labels:
prometheus: 'server04'

- job_name: 'server05'
scrape_interval: 15s
basic_auth:
username: admin
password: 12345678
tls_config:
ca_file: /etc/prometheus/pki/ca.crt
cert_file: /etc/prometheus/pki/client.crt
key_file: /etc/prometheus/pki/client.key

honor_labels: true
metrics_path: '/federate'
scheme: https
params:
'match[]':
- '{job=~"server.*"}'

static_configs:
- targets:
- '192.168.112.133:19090'
labels:
prometheus: 'server05'

rule_files:
- "/opt/prometheus/rules/prometheus.yaml"

EOF

cat <<EOF > /opt/prometheus/rules/prometheus.yaml
groups:
- name: node.down
rules:
- alert: node:down
expr: |
up{instance="192.168.112.131",job="server04"} == 0
for: 5s
labels:
severity: node
annotations:
summary: "节点当机"
description: "节点{{ $labels.instance }}当机了,请抓紧排查原因。"
EOF

docker run -d --name=prometheus -p 9090:9090 -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v /opt/prometheus/rules/prometheus.yaml:/opt/prometheus/rules/prometheus.yaml -v /opt/prometheus/pki/:/etc/prometheus/pki/ -v /etc/localtime:/etc/localtime prom/prometheus:v2.11.1 --config.file=/etc/prometheus/prometheus.yml --web.external-url="http://192.168.122.131:19090/" --web.route-prefix="/"

# 在server05上执行
cat <<EOF > /opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: server05
static_configs:
- targets: ['192.168.112.132:9100']
labels:
instance: 192.168.112.132
environment: dev

rule_files:
- "/opt/prometheus/rules/prometheus.yaml"

EOF

cat <<EOF > /opt/prometheus/rules/prometheus.yaml
groups:
- name: node.down
rules:
- alert: node:down
expr: |
up{instance="192.168.112.132",job="server05"} == 0
for: 5s
labels:
severity: node
annotations:
summary: "节点当机"
description: "节点{{ $labels.instance }}当机了,请抓紧排查原因。"
EOF

docker run -d --name=prometheus -p 9090:9090 -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v /opt/prometheus/rules/prometheus.yaml:/opt/prometheus/rules/prometheus.yaml -v /etc/localtime:/etc/localtime prom/prometheus:v2.11.1 --config.file=/etc/prometheus/prometheus.yml --web.external-url="http://192.168.122.132:19090/" --web.route-prefix="/"

# 在server06上执行
cat <<EOF > /opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: server06
static_configs:
- targets: ['192.168.112.133:9100']
labels:
instance: 192.168.112.133
environment: dev

rule_files:
- "/opt/prometheus/rules/prometheus.yaml"

EOF

cat <<EOF > /opt/prometheus/rules/prometheus.yaml
groups:
- name: node.down
rules:
- alert: node:down
expr: |
up{instance="192.168.112.133",job="server06"} == 0
for: 5s
labels:
severity: node
annotations:
summary: "节点当机"
description: "节点{{ $labels.instance }}当机了,请抓紧排查原因。"
EOF

docker run -d --name=prometheus -p 9090:9090 -v /opt/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v /opt/prometheus/rules/prometheus.yaml:/opt/prometheus/rules/prometheus.yaml -v /etc/localtime:/etc/localtime prom/prometheus:v2.11.1 --config.file=/etc/prometheus/prometheus.yml --web.external-url="http://192.168.122.133:19090/" --web.route-prefix="/"

四、如何验证

访问 https://192.168.112.131:19090/targets ,然后输入用户名和密码就可以看到,所有的targets都是健康的。如下图所示:
targets

五、参考资料

1. 官方资料

https://prometheus.io/docs/guides/basic-auth/
https://prometheus.io/docs/guides/tls-encryption/
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#tls_config

2 非官方资料

http://nginx.org/en/docs/http/configuring_https_servers.html
https://www.cnblogs.com/shenlinken/p/9968274.html
https://www.cnblogs.com/qiyueqi/p/11551238.html
https://my.oschina.net/sskxyz/blog/1554093?utm_source=debugrun&utm_medium=referral