模型服务请求处理吞吐量监控系统
监控指标定义
在模型服务中,我们重点关注以下吞吐量相关指标:
- QPS (Queries Per Second): 每秒请求数
- P95延迟: 95%请求的响应时间
- 错误率: 请求失败比例
- 并发请求数: 同时处理的请求数
Prometheus监控配置
# prometheus.yml
scrape_configs:
- job_name: 'model-service'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
scrape_interval: 15s
监控指标暴露代码示例
from prometheus_client import Counter, Histogram, Gauge
from flask import Flask, request
import time
app = Flask(__name__)
request_count = Counter('model_requests_total', 'Total requests', ['endpoint'])
request_duration = Histogram('model_request_duration_seconds', 'Request duration')
active_requests = Gauge('model_active_requests', 'Active requests')
@app.route('/predict', methods=['POST'])
def predict():
start_time = time.time()
active_requests.inc()
try:
# 模型预测逻辑
result = model.predict(request.json)
request_count.labels(endpoint='/predict').inc()
return result
finally:
duration = time.time() - start_time
request_duration.observe(duration)
active_requests.dec()
告警配置方案
创建告警规则文件 model-alerts.yml:
groups:
- name: model-alerts
rules:
- alert: HighQPS
expr: rate(model_requests_total[5m]) > 1000
for: 2m
labels:
severity: critical
annotations:
summary: "模型服务QPS过高"
description: "每秒请求数超过1000,当前{{ $value }}"
- alert: HighLatency
expr: histogram_quantile(0.95, sum(rate(model_request_duration_seconds_bucket[5m])) by (le)) > 2
for: 3m
labels:
severity: warning
annotations:
summary: "模型服务延迟过高"
description: "95%请求延迟超过2秒"
告警集成
配置告警接收器,将告警推送到钉钉群:
route:
receiver: 'dingtalk'
routes:
- match:
alertname: "HighQPS"
receiver: "dingtalk"
receivers:
- name: "dingtalk"
webhook_configs:
- url: "https://oapi.dingtalk.com/robot/send?access_token=your-token"
通过以上配置,可实现对模型服务吞吐量的实时监控和自动告警。建议每小时检查一次指标趋势,确保系统稳定运行。

讨论