项目地址
官网: https://github.com/prometheus-operator/prometheus-operator
helm: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
使用 helm 方便修改文件内容
配置思路
安装前基于文件修改, 修改完后查看
/etc/kubeasz/bin/helm upgrade prometheus –install -n monitor -f prom-values.yaml /etc/kubeasz/roles/cluster-addon/files/kube-prometheus-stack-45.23.0.tgz –dry-run > prometheus.yaml
1 | ## Provide a k8s version to auto dashboard import script example: kubeTargetVersionOverride: 1.16.6 |
安装后也可以进行修改, 最主要的就是这些
1 | alertmanagerconfigs.monitoring.coreos.com |
然后就是 各种 configMap 和 secrets
例如如下操作 Alertmanager的配置
kubectl edit alertmanagerconfigs.monitoring.coreos.com -n monitor1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
labels:
alertmanagerConfig: example
name: config-example
namespace: monitor
spec:
receivers:
- name: webhook
webhookConfigs:
- url: http://example.com/
route:
groupBy:
- job
groupInterval: 5m
groupWait: 30s
receiver: webhook
repeatInterval: 12h
会与最初的文件合并,一起形成新的配置
prometheus 修改
自动发现配置(不推荐)
这个项目没有 一般在 annotations 添加 prometheus.io/scrape=true prometheus 会自动将其加入 target,配置起来不方便, 一般使用 ServiceMonitor 和 PodMonitors
示例1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/port: "80"
prometheus.io/scrape: "true"
labels:
app: test-frontend-250
name: test-frontend-250-svc
namespace: default
spec:
ports:
- name: test-frontend-250-http
nodePort: 31250
port: 80
targetPort: 80
selector:
app: test-frontend-250
type: NodePort
prometheus-additional.yaml 新添加的 scrape 配置
1 | - job_name: 'kubernetes-service-endpoints' |
创建 secret1
2
kubectl create secret generic additional-configs --from-file=prometheus-additional.yaml -n monitor
prometheus 添加配置
kubectl edit prometheus -n monitor1
2
3
4spec:
additionalScrapeConfigs:
name: additional-configs
key: prometheus-additional.yaml
使用 ServiceMonitor、 PodMonitors 和 probe
这两个配置是差不多的。
PodMonitors 针对 pod 发现
ServiceMonitor 针对 service 发现
probe 一般自定义发现
查看配置
kubectl get prometheus -n monitor prometheus-kube-prometheus-prometheus -o yaml1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16spec:
podMonitorSelector:
matchLabels:
release: prometheus
probeSelector:
matchLabels:
release: prometheus
ruleSelector:
matchLabels:
release: prometheus
serviceMonitorSelector:
matchLabels:
release: prometheus
这些配置均需要带 label release: prometheus 才能匹配
使用 PodMonitors
因为postgres 的 pod 标签有主备的标识,针对 pod 更方便点
PodMonitor-postgres.yaml1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: crunchy-postgres-exporter-master
namespace: monitor
labels:
release: prometheus
spec:
podMetricsEndpoints:
- interval: 30s
port: exporter
path: /metrics
relabelings: # 改变 label 方便使用官方的告警以及面板规则
- replacement: master
targetLabel: role
- replacement: postgres-operator:test
targetLabel: pg_cluster
- replacement: test
targetLabel: cluster
- sourceLabels: [pod]
targetLabel: deployment
regex: (.*?)-\d+
replacement: $1
action: replace
- sourceLabels: [__meta_kubernetes_pod_ip]
targetLabel: ip
action: replace
- sourceLabels: [job]
targetLabel: job
regex: monitor/(.*)-(master|replica)
replacement: $1
action: replace
- sourceLabels: [namespace]
targetLabel: kubernetes_namespace
action: replace
namespaceSelector:
matchNames:
- postgres-operator
selector:
matchExpressions:
- key: postgres-operator.crunchydata.com/role
operator: In
values: [master]
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: crunchy-postgres-exporter-replica
namespace: monitor
labels:
release: prometheus
spec:
podMetricsEndpoints:
- interval: 30s
port: exporter
path: /metrics
relabelings:
- replacement: replica
targetLabel: role
- replacement: postgres-operator:test
targetLabel: pg_cluster
- replacement: test
targetLabel: cluster
- sourceLabels: [pod]
targetLabel: deployment
regex: (.*?)-\d+
replacement: $1
action: replace
- sourceLabels: [__meta_kubernetes_pod_ip]
targetLabel: ip
action: replace
- sourceLabels: [job]
targetLabel: job
regex: monitor/(.*)-(master|replica)
replacement: $1
action: replace
- sourceLabels: [namespace]
targetLabel: kubernetes_namespace
action: replace
namespaceSelector:
matchNames:
- postgres-operator
selector:
matchExpressions:
- key: postgres-operator.crunchydata.com/role
operator: In
values: [replica]
配置 PrometheusRule
默认文件, 节选1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31kind: ConfigMap
metadata:
labels:
app.kubernetes.io/name: postgres-operator-monitoring
vendor: crunchydata
name: alertmanager-rules-config
apiVersion: v1
data:
crunchy-alert-rules-pg.yml: |
groups:
- name: alert-rules
rules:
- alert: PGExporterScrapeError
expr: pg_exporter_last_scrape_error > 0
for: 60s
labels:
service: postgresql
severity: critical
severity_num: 300
annotations:
summary: 'Postgres Exporter running on {{ $labels.job }} (instance: {{ $labels.instance }}) is encountering scrape errors processing queries. Error count: ( {{ $value }} )'
- alert: ExporterDown
expr: avg_over_time(up[5m]) < 0.5
for: 10s
labels:
service: system
severity: critical
severity_num: 300
annotations:
description: 'Metrics exporter service for {{ $labels.job }} running on {{ $labels.instance }} has been down at least 50% of the time for the last 5 minutes. Service may be flapping or down.'
summary: 'Prometheus Exporter Service Down'
修改成 PrometheusRule, 然后 apply 就行了
kubectl apply -f prometheusrule-postgres.yaml1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
release: prometheus # 带有这个 labels 匹配
name: postgres-operator.rule
namespace: monitor
spec:
groups:
- name: postgres-operator-rule
rules:
- alert: PGExporterScrapeError
expr: pg_exporter_last_scrape_error > 0
for: 60s
labels:
service: postgresql
severity: critical
severity_num: 300
annotations:
summary: 'Postgres Exporter running on {{ $labels.job }} (instance: {{ $labels.instance }}) is encountering scrape errors processing queries. Error count: ( {{ $value }} )'
- alert: ExporterDown
expr: avg_over_time(up[5m]) < 0.5
for: 10s
labels:
service: system
severity: critical
severity_num: 300
annotations:
description: 'Metrics exporter service for {{ $labels.job }} running on {{ $labels.instance }} has been down at least 50% of the time for the last 5 minutes. Service may be flapping or down.'
summary: 'Prometheus Exporter Service Down'
alertmanager-config 修改
一般不用修改
1 | apiVersion: monitoring.coreos.com/v1alpha1 |
grafana 修改
grafana 模版的配置文件, 一般开源组件都自带,或者去 grafana 下载,或者自行配置
1 | [root@test-17 dashboards]# ls -l |
一次性使用上面所有文件,或者按需一个一个配置
json 文件内容"datasource": "PROMETHEUS" 需要替换成 "datasource": "Prometheus"
1 | find . -type f -exec sed -i 's/"datasource": "PROMETHEUS"/"datasource": "Prometheus"/g' {} \; |
需要给 cm 打上 lables 才能被 grafana 动态识别
默认 grafana_dashboard=1 ,也可以放到不同的文件夹中1
kubectl -n monitor label cm grafana-postgres-overview grafana_dashboard=1
查看 dashboard1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16[root@test-17 monitoring]# kubectl get cm -n monitor prometheus-grafana-config-dashboards -o yaml
apiVersion: v1
data:
provider.yaml: |-
apiVersion: 1
providers:
- name: 'sidecarProvider'
orgId: 1
folder: ''
type: file
disableDeletion: false
allowUiUpdates: false
updateIntervalSeconds: 30
options:
foldersFromFilesStructure: false
path: /tmp/dashboards
补充
使用 serviceMonitor
监控 redis 示例
1 | apiVersion: monitoring.coreos.com/v1 |
redis-svc
1 | [root@test-17 monitoring]# kubectl get svc -n redis redis-metrics --show-labels |
详细内容
1 | apiVersion: v1 |
使用 probes
1 | apiVersion: monitoring.coreos.com/v1 |
这样就可以检测了。但需要先部署 blackbox-exporter
给grafana 开启oidc
grafana.ini
kubectl edit cm -n monitor grafana
`
[server]
protocol = http
domain = grafana.test.com
root_url = https://grafana.test.com/
[auth.generic_oauth]
enabled = true
allow_sign_up = true
auto_login = false
client_id = app_xxxxx
client_secret = CSHPuFPgkxXUtSbGwWLyDgeSwGWtsYq4PSy2GTovoMJVA3
scopes = openid profile email
auth_url = https://kzipufcn.aliyunidaas.com/login/app/app_xxxxx/oauth2/authorize
token_url = https://eiam-api-cn-hangzhou.aliyuncs.com/v2/idaas_sqsdywwjwwvzug45qq46ylbwdm/app_xxxxx/oauth2/token
api_url = https://eiam-api-cn-hangzhou.aliyuncs.com/v2/idaas_sqsdywwjwwvzug45qq46ylbwdm/app_xxxxx/oauth2/userinfo
redirect_uri = https://grafana.test.com/login/generic_oauth
email_attribute_path = email
role_attribute_path = contains(email, ‘weilai@’) && ‘Admin’ || endsWith(email, '@admin.com‘) && ‘Admin’ || endsWith(email, '@na.com‘) && ‘Editor’ || ‘Viewer’