基于Docker Compose的模型监控容器配置
监控指标配置
在docker-compose.yml中定义三个核心容器:模型服务、Prometheus监控和Grafana可视化。
version: '3.8'
services:
model-service:
image: my-model:latest
container_name: model-api
ports:
- "5000:5000"
environment:
- MODEL_METRICS_PORT=9090
# 模型性能指标导出
metrics:
- type: prometheus
path: /metrics
port: 9090
prometheus:
image: prometheus:latest
container_name: prometheus-server
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana-dashboard
ports:
- "3000:3000"
depends_on:
- prometheus
关键监控指标
- 模型响应时间:
model_response_time_seconds{quantile="0.95"} - 模型错误率:
rate(model_errors_total[5m]) - 内存使用率:
process_resident_memory_bytes - CPU使用率:
rate(container_cpu_usage_seconds_total[5m])
告警配置方案
在prometheus.yml中添加告警规则:
rule_files:
- "alert.rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
alert.rules.yml内容:
groups:
- name: model-alerts
rules:
- alert: ModelLatencyHigh
expr: model_response_time_seconds{quantile="0.95"} > 2
for: 5m
labels:
severity: warning
annotations:
summary: "模型响应时间过高"
description: "模型95%响应时间超过2秒,当前值为 {{ $value }} 秒"
- alert: ModelErrorRate
expr: rate(model_errors_total[1m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "模型错误率异常"
description: "模型错误率超过10%,当前值为 {{ $value }}"
复现步骤
- 创建
docker-compose.yml和配置文件 - 执行
docker-compose up -d - 访问
http://localhost:3000查看Grafana面板 - 在Prometheus中验证告警规则是否生效
监控容器启动后,系统将自动收集模型指标并按需触发告警。

讨论