基于Docker Swarm的模型监控系统配置

最近在部署基于Docker Swarm的机器学习模型监控平台时踩了不少坑，分享一下实际配置过程。

核心监控指标配置

首先需要在Docker Swarm服务中集成Prometheus监控。创建docker-compose.yml文件：

version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.37.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure

  model-api:
    image: my-model-api:v1.0
    deploy:
      resources:
        limits:
          memory: 2G
        reservations:
          memory: 512M
    # 添加健康检查
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Prometheus配置文件`prometheus.yml`

scrape_configs:
  - job_name: 'docker-swarm'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: /metrics
    # 监控模型推理延迟
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'model_inference_duration_seconds'
        target_label: model_type
        replacement: 'production'

告警配置方案

在Alertmanager中配置告警规则：

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'slack'

receivers:
- name: 'slack'
  slack_configs:
  - send_resolved: true
    text: "{{ .CommonAnnotations.description }}"

rules:
  - alert: ModelLatencyHigh
    expr: rate(model_inference_duration_seconds_sum[5m]) / rate(model_inference_duration_seconds_count[5m]) > 1.0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "模型推理延迟过高"
      description: "模型平均延迟超过1秒，当前值：{{ $value }}秒"

实际踩坑记录

内存泄漏监控：最初只配置了CPU使用率，结果发现内存泄漏时CPU正常但服务崩溃。通过添加container_memory_usage_bytes指标解决。
告警风暴处理：最初没有设置for条件，导致微小波动就触发告警。现在设置至少持续2分钟才触发。
Docker事件监听：使用docker events --filter 'event=die'监听容器异常退出，配合Slack告警。

建议在生产环境中至少配置：延迟监控、内存使用率、CPU使用率、服务健康状态等核心指标。

Ethan333 · 2026-01-08T10:24:58

Docker Swarm下监控配置的关键在于服务发现与指标暴露，建议使用Consul或自定义SD来动态发现模型服务实例，避免硬编码targets导致的维护成本。

蓝色妖姬 · 2026-01-08T10:24:58

Prometheus配置中应明确指定job_name和labels，便于后续在Grafana中做多维度聚合分析，当前配置缺少model_type标签的注入机制。

Arthur481 · 2026-01-08T10:24:58

健康检查虽已加入，但需配合容器重启策略与负载均衡使用，避免单点故障影响整体服务可用性，建议增加service mesh层如Istio进行流量管理。

WellWeb · 2026-01-08T10:24:58

告警规则应结合业务场景设置阈值，例如推理延迟超过500ms即触发告警，并通过Webhook集成钉钉或企业微信实现即时通知，提升响应效率。

基于Docker Swarm的模型监控系统配置