机器学习模型部署后性能评估
核心监控指标配置
在模型上线后,需重点监控以下关键指标:
1. 准确率指标
model_accuracy: 整体准确率,阈值设置为0.95precision_score: 精确率,阈值0.90recall_score: 召回率,阈值0.85
2. 性能指标
model_latency: 平均响应时间,超过500ms触发告警throughput: 每秒处理请求数,低于100TPS时告警memory_usage: 内存占用率,超过80%时告警
3. 数据质量指标
data_drift_score: 数据漂移检测,阈值0.3model_drift_score: 模型漂移,阈值0.25
告警配置方案
# prometheus告警规则
ALERT ModelPerformanceDegradation
IF model_accuracy < 0.95
FOR 5m
ANNOTATIONS {
summary = "模型准确率下降到{{ $value }}",
description = "当前准确率为 {{ $value }},低于设定阈值0.95"
}
ALERT HighLatency
IF model_latency > 500
FOR 2m
ANNOTATIONS {
summary = "响应时间超过500ms",
description = "模型响应时间达到 {{ $value }}ms"
}
可复现评估步骤
- 部署Prometheus监控服务:
docker run -d --name prometheus -p 9090:9090 prom/prometheus - 配置模型指标收集器,使用以下代码:
import prometheus_client
from prometheus_client import Gauge, Counter
accuracy_gauge = Gauge('model_accuracy', '当前模型准确率')
latency_gauge = Gauge('model_latency', '平均响应时间')
# 更新指标值
accuracy_gauge.set(current_accuracy)
latency_gauge.set(current_latency)
- 创建告警通知:
curl -X POST http://localhost:9093/api/v1/alerts

讨论