Prometheus监控RabbitMQ集群
系统环境
- 操作系统: CentOS 7.6
- Docker 版本: 19.03.5
- Prometheus 版本: 2.36.0
- Kubernetes 版本: 1.20.0
- RabbitMQ 版本: 3.8.27
监控方式
RabbitMQ内部集成Prometheus来获取指标
- 3.8.0之前版本,RabbitMQ可以使用单独的插件prometheus_rabbitmq_exporter来向Prometheus公开指标,要单独下载到RabbitMQ安装目录中进行安装;prometheus_rabbitmq_exporter:https://github.com/deadtrickster/prometheus_rabbitmq_exporter
- 3.8.0版本后,RabbitMQ附带了内置的Prometheus&Grafana支持,虽然内置了该插件,但也要进行安装,rabbitmq-prometheus:https://github.com/rabbitmq/rabbitmq-prometheus
使用独立程序来获取指标(RabbitMQ_exporter)
不管什么版本都能使用,要单独启动exporter进程rabbitmq_exporter:https://github.com/kbudde/rabbitmq_exporter
本文RabbitMQ集群是部署在k8s集群内,且版本为3.8.27,可以参考博客k8s部署Rabbitmq集群一文,如若你是k8s集群外部的rabbitmq集群,需要通过Endpoints代理到k8s集群中,再执行如下操作,本文会介绍两种监控获取指标的方式
使用rabbitmq-exporter方式监控
部署rabbitmq-exporter
[root@k8s01 rabbitmq]# vim rabbitmq-exporter.yaml
apiVersion: v1
kind: Service
metadata:
name: rabbitmq-exporter
namespace: monitoring
labels:
k8s-app: rabbitmq-exporter
spec:
type: ClusterIP
clusterIP: None
ports:
- name: api
port: 9099
protocol: TCP
selector:
k8s-app: rabbitmq-exporter
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: rabbitmq-exporter
namespace: monitoring
labels:
k8s-app: rabbitmq-exporter
spec:
replicas: 1
selector:
matchLabels:
k8s-app: rabbitmq-exporter
template:
metadata:
labels:
k8s-app: rabbitmq-exporter
spec:
containers:
- name: rabbitmq-exporter
image: kbudde/rabbitmq-exporter:v0.29.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 9099
env:
- name: PUBLISH_PORT
value: "9099"
- name: RABBIT_CAPABILITIES
value: "bert,no_sort"
- name: RABBIT_USER
value: "super"
- name: RABBIT_PASSWORD
value: "super"
- name: RABBIT_URL
value: http://rabbitmq-cluster.tools-env:15672
[root@k8s01 rabbitmq]# kubectl apply -f rabbitmq-exporter.yaml
service/rabbitmq-exporter created
deployment.apps/rabbitmq-exporter created
部署完成后,可以执行以下命令查看是否收集到指标
[root@k8s01 rabbitmq]# kubectl get -n monitoring po -owide|grep rabbitmq-exporter
rabbitmq-exporter-549b5fddfc-qnjj6 1/1 Running 0 11h 10.244.1.233 k8s05.xxx.local <none> <none>
[root@k8s01 rabbitmq]# curl http://10.244.1.233:9099/metrics
...
# HELP rabbitmq_version_info A metric with a constant '1' value labeled by rabbitmq version, erlang version, node, cluster.
# TYPE rabbitmq_version_info gauge
...
Prometheus配置监控rabbitmq
之前k8s部署peometheus了,部署过程中将Prometheus的配置文件存储在Kubernetes的 ConfigMap资源里进行存储,所以需要修改ConfigMap资源中的配置内容,在配置中添加 rabbitmq相关配置
[root@k8s01 prometheus]# cat prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "kubernetes"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- /etc/prometheus/*-rule.yml
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['127.0.0.1:9090']
labels:
instance: prometheus
#添加如下配置
- job_name: 'rabbitmq_nodes'
static_configs:
- targets: ['rabbitmq-exporter:9099'] #这里填写rabbitmq-exporter地址
labels:
instance: rabbitmq-cluster
...
[root@k8s01 prometheus]# kubectl apply -f prometheus-config.yaml
configmap/prometheus-config changed
重载prometheus配置
[root@k8s01 prometheus]# curl -XPOST http://10.x.x.x:30089/-/reload
查看Prometheus UI页面
Grafana引入监控模板
登入Grafana界面,点击Grafana左侧栏菜单,选择Manage菜单,进入后点击右上角 Import按钮,设置Import的ID号为2121,引入rabbitmq-exporter模板,然后点击Load按钮进入配置数据库,选择使用Prometheus数据库,之后点击Import按钮进入看板:
使用rabbitmq内置方式Prometheus插件监控
由于已经使用部署好了rabbitmq集群,这里不做过多复述,如若你是k8s集群外部的rabbitmq集群,需要通过Endpoints代理到k8s集群中
启动rabbitmq_prometheus插件
已经启动无需执行,如未启动需要执行命令开启
rabbitmq-plugins enable rabbitmq_prometheus
指标校验
rabbitmq_prometheus 插件启用成功后,可以通过curl工具测试是否输出Prometheus需要的指标,如下所示
[root@k8s01 prometheus]# curl -s localhost:15692/metrics | head -n 6
# TYPE erlang_mnesia_held_locks gauge
# HELP erlang_mnesia_held_locks Number of held locks.
erlang_mnesia_held_locks 0
Prometheus配置监控rabbitmq
之前k8s部署peometheus了,部署过程中将Prometheus的配置文件存储在Kubernetes的 ConfigMap资源里进行存储,所以需要修改ConfigMap资源中的配置内容,在配置中添加 rabbitmq相关配置
[root@k8s01 prometheus]# cat prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "kubernetes"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- /etc/prometheus/*-rule.yml
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ['127.0.0.1:9090']
labels:
instance: prometheus
- job_name: 'rabbitmq_cluster'
static_configs:
- targets: ['rabbitmq-0.rabbitmq.tools-env.svc.cluster.local:15692','rabbitmq-1.rabbitmq.tools-env.svc.cluster.local:15692','rabbitmq-2.rabbitmq.tools-env.svc.cluster.local:15692','rabbitmq-3.rabbitmq.tools-env.svc.cluster.local:15692','rabbitmq-4.rabbitmq.tools-env.svc.cluster.local:15692'] #根据实际部署位置修改
labels:
instance: rabbitmq-cluster
...
[root@k8s01 prometheus]# kubectl apply -f prometheus-config.yaml
configmap/prometheus-config configured
重载prometheus配置
[root@k8s01 prometheus]# curl -XPOST http://10.x.x.x:30089/-/reload
查看Prometheus UI页面
Grafana引入监控模板
登入Grafana界面,点击Grafana左侧栏菜单,选择Manage菜单,进入后点击右上角 Import按钮,设置Import的ID号为10991,引入rabbitmq_prometheus模板,然后点击Load按钮进入配置数据库,选择使用Prometheus数据库,之后点击Import按钮进入看板:
RabbitMQ告警配置
上述完成了对RabbitMQ集群的监控,平时很少看监控页面,需要配置对应的告警规则来实现对集群的监控,由于已经部署好了prometheus且通过ConfigMap管理配置文件,修改ConfigMap,增加告警规则即可
注意:本文告警规则是基于rabbitmq_prometheus编写的,rabbitmq-exporter并不适用
[root@k8s01 prometheus]# vim prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
...
rabbitmq-rule.yml: |
groups:
- name: RabbitMQ节点宕机
rules:
- alert: RabbitmqNodeDown
expr: sum(rabbitmq_build_info) < 3
for: 0m
labels:
severity: error
annotations:
summary: "Rabbitmq node down (instance {{ $labels.instance }})"
description: "RabbitMQ集群中运行的节点少于3个\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- name: RabbitMQ节点未分发
rules:
- alert: RabbitmqNodeNotDistributed
expr: erlang_vm_dist_node_state < 3
for: 0m
labels:
severity: warning
annotations:
summary: "Rabbitmq node not distributed (instance {{ $labels.instance }})"
description: "RabbitMQ集群分发链接状态未启动\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- name: RabbitMQ内存高于90%
rules:
- alert: RabbitmqMemoryHigh
expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100 > 90
for: 2m
labels:
severity: warning
annotations:
summary: "Rabbitmq memory high (instance {{ $labels.instance }})"
description: "RabbitMQ集群节点使用超过90%的已分配 RAM\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- name: RabbitMQ未确认消息过高
rules:
- alert: RabbitmqTooManyUnackMessages
expr: sum(rabbitmq_queue_messages_unacked) BY (queue) > 1000
for: 1m
labels:
severity: warning
annotations:
summary: "Rabbitmq too many unack messages (instance {{ $labels.instance }})"
description: "RabbitMQ集群未确认的消息大于1000\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- name: RabbitMQ节点总连接数太高
rules:
- alert: RabbitmqTooManyConnections
expr: rabbitmq_connections > 1000
for: 2m
labels:
severity: warning
annotations:
summary: "Rabbitmq too many connections (instance {{ $labels.instance }})"
description: "RabbitMQ集群节点的总连接数大于1000\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- name: RabbitMQ没有队列消费者
rules:
- alert: RabbitmqNoQueueConsumer
expr: rabbitmq_queue_consumers < 1
for: 1m
labels:
severity: warning
annotations:
summary: "Rabbitmq no queue consumer (instance {{ $labels.instance }})"
description: "RabbitMQ集群队列的消费者少于1个\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
[root@k8s01 prometheus]# kubectl apply -f prometheus-config.yaml
configmap/prometheus-config configured
重载prometheus配置
[root@k8s01 prometheus]# curl -XPOST http://10.105.x.x:30089/-/reload
查看Prometheus UI界面
打开Prometheus UI界面,点击Alerts,查看规则是否生效