Kafka集群状态监控配置
核心监控指标配置
1. 集群健康状态
kafka.server:type=KafkaServer,name=BrokerState(0=启动中, 1=运行中, 2=停止)kafka.controller:type=ControllerStats,name=ActiveControllerCount(应为1)
2. 消费者组监控
kafka.consumer:type=consumer-fetch-manager-metrics,name=records-lag(消费者滞后量)kafka.consumer:type=consumer-coordinator-metrics,name=join-time-ms(消费者加入时间)
3. 生产者监控
kafka.producer:type=producer-metrics,name=record-send-rate(消息发送速率)kafka.producer:type=producer-metrics,name=produce-failure-rate(生产失败率)
告警配置方案
# Prometheus告警规则配置
rules:
- alert: KafkaBrokerDown
expr: kafka_server_KafkaServer_BrokerState{job="kafka"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Kafka Broker宕机"
description: "集群{{ $labels.instance }} Broker状态异常,当前值为{{ $value }}"
- alert: HighConsumerLag
expr: kafka_consumer_consumer_fetch_manager_metrics_records_lag > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "消费者滞后量过高"
description: "消费者滞后量超过10000条,当前滞后{{ $value }}条"
监控面板配置
{
"dashboard": {
"title": "Kafka集群监控",
"panels": [
{
"title": "Broker状态",
"targets": ["kafka_server_KafkaServer_BrokerState"]
},
{
"title": "消费者滞后量",
"targets": ["kafka_consumer_consumer_fetch_manager_metrics_records_lag"]
}
]
}
}
配置步骤:
- 在Prometheus中添加Kafka JMX Exporter
- 配置上述告警规则文件
- 部署Grafana仪表板
- 设置Slack/Webhook通知通道

讨论