微服务依赖关系监控踩坑记录
问题背景
在构建ML模型监控平台时,发现服务间依赖关系异常导致模型推理延迟飙升。通过Prometheus+Grafana监控体系,终于定位到根本原因。
核心监控指标配置
# prometheus.yml 配置
scrape_configs:
- job_name: 'model-service'
metrics_path: /metrics
static_configs:
- targets: ['localhost:8080']
metric_relabel_configs:
# 监控服务调用延迟
- source_labels: [__name__]
regex: 'http_request_duration_seconds_bucket'
target_label: service
replacement: model-inference
# 监控依赖服务状态
- source_labels: [__name__]
regex: 'up'
target_label: dependency
replacement: data-service
告警规则设置
# alert.rules.yaml
groups:
- name: service-dependency
rules:
- alert: HighDependencyLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="model-service"}[5m])) by (le)) > 2
for: 3m
labels:
severity: critical
annotations:
summary: "模型服务依赖延迟过高"
description: "数据服务响应时间超过2秒,持续3分钟"
- alert: DependencyDown
expr: up{job="data-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "依赖服务宕机"
description: "数据服务不可用,模型推理将失败"
实际踩坑经历
最初只监控了HTTP请求次数,结果发现服务间调用超时却无告警。后来加入http_request_duration_seconds_bucket指标,并配置95%分位数告警,才真正发现问题。建议在生产环境至少配置3个监控维度:延迟、成功率、错误率。
可复现步骤
- 部署model-service和data-service
- 在Prometheus中添加上述配置
- 观察Grafana面板中的服务调用关系图
- 模拟data-service超时,验证告警触发

讨论