容器化应用资源监控

在Kubernetes环境中监控容器化ML应用的资源使用情况，需要配置Prometheus抓取指标并设置告警规则。

监控指标配置

首先在Deployment中添加资源限制和请求：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      containers:
      - name: model-container
        image: my-ml-model:latest
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        ports:
        - containerPort: 8080

Prometheus抓取配置

在prometheus.yml中添加job配置：

scrape_configs:
- job_name: 'kubernetes-pods'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    action: keep
    regex: true
  - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
    action: replace
    target_label: __metrics_path__
    regex: (.+)
  - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    action: replace
    target_label: __address__
    regex: ([^:]+)(?::[0-9]+)?;([0-9]+)
    replacement: $1:$2

告警规则配置

创建alerting-rules.yml文件：

groups:
- name: container-alerts
  rules:
  - alert: HighCPUUsage
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "容器CPU使用率过高"
      description: "容器CPU使用率达到{{ $value }}，超过阈值0.8"
  - alert: MemoryExceeded
    expr: container_memory_usage_bytes > 1073741824
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "容器内存超出限制"
      description: "容器内存使用超过1GB，当前值{{ $value }}"

监控验证

通过以下命令验证指标抓取：

kubectl port-forward svc/prometheus-service 9090:9090
# 访问http://localhost:9090
# 在查询框输入：container_memory_usage_bytes

Nora595 · 2026-01-08T10:24:58

别把Prometheus当万能药，监控指标配置只是第一步，真正要命的是告警阈值设置。我见过太多团队配置了1G内存限制，却在生产环境频繁触发OOMKilled，根本原因是没搞清楚自己的模型到底吃多少资源。建议先在测试环境跑足负载，再根据实际使用率设置合理的请求和限制。

ColdMouth · 2026-01-08T10:24:58

容器监控不能只看CPU和内存，特别是ML应用，IO瓶颈往往比资源耗尽更致命。我之前就因为没监控磁盘读写速率，导致训练任务卡死在数据加载阶段。建议加上disk I/O、网络带宽等关键指标，别让Prometheus只盯着那几个标准字段。

LongVictor · 2026-01-08T10:24:58

Kubernetes的资源调度机制决定了容器化应用的监控必须跟上。很多团队在部署时只设置默认资源限制，结果就是节点资源争抢导致整个集群雪崩。我建议每个模型都做一次资源画像，建立基准线，然后用HPA配合自动扩缩容，而不是靠人工盯着监控面板

容器化应用资源监控