容器化应用资源监控
在Kubernetes环境中监控容器化ML应用的资源使用情况,需要配置Prometheus抓取指标并设置告警规则。
监控指标配置
首先在Deployment中添加资源限制和请求:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model-deployment
spec:
replicas: 3
selector:
matchLabels:
app: ml-model
template:
metadata:
labels:
app: ml-model
spec:
containers:
- name: model-container
image: my-ml-model:latest
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
ports:
- containerPort: 8080
Prometheus抓取配置
在prometheus.yml中添加job配置:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::[0-9]+)?;([0-9]+)
replacement: $1:$2
告警规则配置
创建alerting-rules.yml文件:
groups:
- name: container-alerts
rules:
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "容器CPU使用率过高"
description: "容器CPU使用率达到{{ $value }},超过阈值0.8"
- alert: MemoryExceeded
expr: container_memory_usage_bytes > 1073741824
for: 5m
labels:
severity: critical
annotations:
summary: "容器内存超出限制"
description: "容器内存使用超过1GB,当前值{{ $value }}"
监控验证
通过以下命令验证指标抓取:
kubectl port-forward svc/prometheus-service 9090:9090
# 访问http://localhost:9090
# 在查询框输入:container_memory_usage_bytes

讨论