机器学习模型部署后的稳定性监控
核心监控指标配置
模型性能指标:
- 准确率(Accuracy): 设置阈值0.95,当低于0.92时触发告警
- AUC值: 监控roc_auc_score,阈值0.90,低于0.85告警
- 推理延迟: 平均响应时间超过100ms时告警
- 预测样本量: 每小时处理样本数低于预期值30%时触发
实施步骤
- 创建监控脚本:
import logging
from prometheus_client import Histogram, Gauge
class ModelMonitor:
def __init__(self):
self.accuracy_gauge = Gauge('model_accuracy', 'Current model accuracy')
self.latency_histogram = Histogram('model_latency_seconds', 'Model inference latency')
def record_metrics(self, accuracy, latency):
self.accuracy_gauge.set(accuracy)
self.latency_histogram.observe(latency)
- 配置告警规则: 在Prometheus中添加以下规则:
groups:
- name: model_alerts
rules:
- alert: ModelAccuracyDrop
expr: model_accuracy < 0.92
for: 5m
labels:
severity: critical
annotations:
summary: "Model accuracy dropped below threshold"
- 集成到CI/CD: 在部署脚本中添加监控服务启动命令:
kubectl apply -f prometheus-deployment.yaml
kubectl port-forward svc/prometheus 9090:9090
告警配置方案
- 一级告警: 准确率<0.92,延迟>100ms,立即通知
- 二级告警: 准确率<0.95,延迟>50ms,30分钟内未恢复
- 三级告警: 样本量下降50%,每小时检查一次

讨论