AlertManager主要用于接收Prometheus发送的告警信息,支持丰富的告警通知渠道,例如邮件、微信、钉钉、Slack等常用的沟通工具。
配置钉钉告警接口
- 创建一个用于接收告警的钉钉群
- 通过“智能群助手”添加机器人
- 创建一个“通过Webhook接入”的自定义服务
- 在设置中填写服务器IP地址
- 保存Webhook地址是token
配置钉钉插件
访问钉钉插件地址检查是否有更新版本,然后下载并启动插件:
cd /opt/prometheus wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz tar zxvf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz cd prometheus-webhook-dingtalk-1.4.0.linux-amd64 # 获取Webhook的token后即可启动 ./prometheus-webhook-dingtalk --ding.profile="dingding=https://oapi.dingtalk.com/robot/send?access_token=xxxxxx"
配置AlertManager
修改AlertManager配置文件alertmanager.yml,修改或添加内容:
... route: group_by: [alertname] receiver: dingding #默认告警接收人 group_wait: 30s group_interval: 1m repeat_interval: 1h routes: - receiver: 'dingding' #告警接收人 group_wait: 30s match_re: alertname: 内存使用率|CPU使用率|磁盘使用率 - receiver: 'dingding_high' #告警接收人 group_wait: 30s match_re: alertname: 主机失去联系 ... receivers: - name: 'dingding' webhook_configs: - url: 'http://10.10.200.201:8060/dingtalk/dingding/send' send_resolved: true - name: dingding_high webhook_configs: - url: http://10.10.200.201:8060/dingtalk/dingding_high/send send_resolved: true # 当与另一组匹配器的规则匹配时,仅其中一组失效,前提是两个告警组必须有相同的标签 inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
配置告警规则
修改rules/memory_usage.yml:
groups: - name: 内存告警规则 rules: - alert: 内存使用率 expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 75 for: 1m labels: user: prometheus severity: warning annotations: description: "内存使用率超过75% (当前:{{ $value }}%)"
修改rules/cpu_usage.yml:
groups: - name: CPU告警规则 rules: - alert: CPU使用率 expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 80 for: 1m labels: user: prometheus severity: warning annotations: description: "服务器CPU使用率超过80% (当前:{{ $value }}%)"
修改rules/disk_usage.yml:
groups: - name: 磁盘告警规则 rules: - alert: 磁盘使用率 expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 90 for: 1m labels: user: prometheus severity: warning annotations: description: "磁盘分区{{ $labels.mountpoint }}空间使用率超过90% (当前: {{ $value }}%)"
修改rules/node_disconnected.yml:
groups: - name: 主机告警规则 rules: - alert: 主机失去联系 expr: up == 0 for: 1m labels: user: prometheus severity: warning annotations: description: "主机{{ $labels.instance }}已经失去联系超过1分钟"
测试告警
# 执行以下命令可以测试通过机器人发送消息 curl -H "Content-Type: application/json" -d '{"msgtype":"text","text":{"content":"prometheus alert test"}}' https://oapi.dingtalk.com/robot/send?access_token=xxxxxx
因为钉钉的安全限制,发送消息时必须包含prometheus,否则prometheus-webhook-dingtalk会报422错误。
原创文章禁止转载:技术学堂 » Prometheus配置AlertManager钉钉告警方法