基于Prometheus的模型服务指标采集脚本
背景
在机器学习模型生产环境中,需要实时监控模型性能和稳定性。本文介绍如何通过Prometheus采集模型服务的核心指标。
核心指标配置
import prometheus_client as pc
from prometheus_client import Gauge, Counter, Histogram
import time
import logging
# 模型推理延迟
model_latency = Histogram('model_inference_seconds', 'Model inference latency', buckets=[0.1, 0.5, 1.0, 2.0, 5.0])
# 模型请求成功率
model_requests_total = Counter('model_requests_total', 'Total model requests')
model_requests_failed = Counter('model_requests_failed', 'Failed model requests')
# 内存使用率
model_memory_usage = Gauge('model_memory_bytes', 'Memory usage of model service')
# CPU使用率
model_cpu_percent = Gauge('model_cpu_percent', 'CPU usage percentage')
# 预测准确性
model_accuracy = Gauge('model_accuracy_rate', 'Model prediction accuracy rate')
# 服务健康状态
model_health_status = Gauge('model_health_status', 'Model service health status (0=unhealthy, 1=healthy)')
采集脚本实现
# 定时采集函数
@pc.start_http_server(8000)
def collect_metrics():
while True:
# 记录请求计数
model_requests_total.inc()
try:
# 模拟模型推理
start_time = time.time()
result = predict_model(input_data)
latency = time.time() - start_time
# 记录延迟
model_latency.observe(latency)
# 更新准确性
accuracy = calculate_accuracy(result, expected_output)
model_accuracy.set(accuracy)
# 更新内存使用
memory_usage = get_memory_usage()
model_memory_usage.set(memory_usage)
# 更新CPU使用率
cpu_percent = get_cpu_usage()
model_cpu_percent.set(cpu_percent)
except Exception as e:
model_requests_failed.inc()
logging.error(f'Model prediction failed: {e}')
model_health_status.set(0)
continue
time.sleep(1) # 每秒采集一次
Prometheus配置文件
scrape_configs:
- job_name: 'model_service'
static_configs:
- targets: ['localhost:8000']
metrics_path: /metrics
scrape_interval: 1s
告警规则配置
在Alertmanager中配置以下告警:
- 推理延迟超过2秒时触发告警
- 内存使用率超过85%时告警
- 请求成功率低于90%时告警
- CPU使用率持续超过95%时告警

讨论