分布式模型服务的健康检查与故障检测系统
核心监控指标配置
1. 模型服务可用性
# prometheus.yml 配置
- job_name: 'model-service'
metrics_path: /health
static_configs:
- targets: ['model-server-0:8080', 'model-server-1:8080', 'model-server-2:8080']
metrics_path: /metrics
scrape_interval: 30s
2. 关键性能指标
- 请求延迟 (P95):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2s - 错误率:
rate(http_requests_total{status=~"5.."}[5m]) > 0.01 - CPU使用率:
rate(container_cpu_usage_seconds_total[5m]) > 0.8 - 内存使用率:
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
告警规则配置
基础告警规则:
# alerting rules
- alert: ModelServiceHighLatency
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 2
for: 3m
labels:
severity: critical
annotations:
summary: "模型服务延迟过高"
description: "P95请求延迟超过2秒,当前值为 {{ $value }} 秒"
- alert: ModelServiceHighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
for: 2m
labels:
severity: warning
annotations:
summary: "模型服务错误率过高"
description: "5xx错误率超过1%,当前值为 {{ $value }}"
健康检查脚本:
#!/bin/bash
# health_check.sh
for server in model-server-0 model-server-1 model-server-2; do
curl -f http://$server:8080/health || echo "[$server] DOWN"
done
实施步骤
- 部署Prometheus监控服务
- 配置模型服务暴露指标端口
- 创建告警规则并集成到Alertmanager
- 设置健康检查定时任务
- 验证监控面板和告警功能

讨论