模型服务可用性指标的多维度监控体系
核心监控指标体系
响应时间监控
- P95响应时间 > 200ms 告警
- P99响应时间 > 500ms 告警
# Prometheus监控配置
metrics:
- name: model_response_time_ms
type: histogram
buckets: [50, 100, 200, 500, 1000]
错误率监控
- 5xx错误率 > 1% 告警
- 4xx错误率 > 5% 告警
# Flask应用监控中间件
from prometheus_client import Histogram, Counter
response_time = Histogram('model_response_seconds', 'Response time')
error_count = Counter('model_errors_total', 'Total errors')
@app.before_request
def start_timer():
g.start_time = time.time()
@app.after_request
async def record_metrics(response):
duration = time.time() - g.start_time
response_time.observe(duration)
if response.status_code >= 500:
error_count.inc()
吞吐量监控
- QPS < 100 请求/秒 告警
- 并发请求数 > 1000 告警
告警配置方案
Prometheus告警规则
# alert.rules.yml
groups:
- name: model_availability
rules:
- alert: HighErrorRate
expr: rate(model_errors_total[5m]) / rate(model_requests_total[5m]) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "高错误率,当前{{ $value }}"
Slack通知配置
{
"channel": "#model-alerts",
"username": "Model Monitor",
"icon_emoji": ":robot_face:",
"text": "⚠️ 模型服务异常:P95响应时间超过阈值"
}
自动恢复检测
- 响应时间恢复正常后自动解除告警
- 连续5分钟无错误自动恢复

讨论