引言
在现代微服务架构中,系统的复杂性和分布式特性使得监控变得至关重要。随着服务数量的增长和业务逻辑的复杂化,传统的监控方式已经无法满足现代应用的可观测性需求。本文将详细介绍如何基于Docker和Kubernetes构建完整的微服务监控体系,重点介绍Prometheus、Grafana和Alertmanager等核心组件的集成应用。
微服务监控的重要性
为什么需要微服务监控?
微服务架构将复杂的单体应用拆分为多个独立的服务,每个服务都有自己的生命周期和部署方式。这种架构虽然带来了灵活性和可扩展性,但也带来了监控的挑战:
- 分布式特性:服务间的调用关系复杂,故障排查困难
- 动态性:服务实例动态伸缩,监控目标不断变化
- 数据分散:指标数据分布在多个服务中,需要统一收集和展示
- 实时性要求:需要实时监控服务状态,及时发现和响应问题
监控体系的核心要素
一个完整的监控体系应该包含以下核心要素:
- 指标收集:从各个服务中收集运行时指标
- 数据存储:可靠地存储收集到的指标数据
- 可视化展示:直观地展示监控数据
- 告警机制:及时发现异常并通知相关人员
- 日志收集:补充指标数据,提供完整的故障排查信息
Prometheus监控系统概述
Prometheus是什么?
Prometheus是一个开源的系统监控和告警工具包,最初由SoundCloud开发。它采用pull模式收集指标数据,具有强大的查询语言PromQL,能够满足复杂的监控需求。
Prometheus的核心特性
- 多维数据模型:通过标签(labels)实现多维数据存储
- PromQL:强大的查询语言,支持复杂的聚合和计算
- 服务发现:自动发现和监控目标
- 告警功能:基于规则的告警机制
- 易于部署:简单易用的部署方式
Prometheus架构
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Prometheus │ │ Service │ │ Service │
│ Server │ │ Discovery │ │ Discovery │
│ │ │ │ │ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Target │ │ │ │ Target │ │ │ │ Target │ │
│ │ (Node) │ │ │ │ (Pod) │ │ │ │ (Pod) │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
│ │ │ │ │ │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Target │ │ │ │ Target │ │ │ │ Target │ │
│ │ (Node) │ │ │ │ (Pod) │ │ │ │ (Pod) │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└────────────────────────┼────────────────────────┘
│
┌─────────────────┐
│ Prometheus │
│ Server │
│ │
│ ┌───────────┐ │
│ │ Rule │ │
│ │ Engine │ │
│ └───────────┘ │
│ │
│ ┌───────────┐ │
│ │ Query │ │
│ │ Engine │ │
│ └───────────┘ │
│ │
│ ┌───────────┐ │
│ │ HTTP │ │
│ │ API │ │
│ └───────────┘ │
└─────────────────┘
Kubernetes环境准备
环境要求
在开始部署监控系统之前,确保具备以下环境:
# 检查Kubernetes集群状态
kubectl cluster-info
kubectl get nodes
# 检查Docker版本
docker --version
# 检查kubectl版本
kubectl version --short
创建监控命名空间
# monitoring-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
kubectl apply -f monitoring-namespace.yaml
部署Prometheus
Prometheus配置文件
首先创建Prometheus的配置文件:
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert.rules"
scrape_configs:
# 采集Prometheus自身指标
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# 采集Kubernetes节点指标
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
target_label: __address__
replacement: '${1}:10250'
- source_labels: [__meta_kubernetes_node_name]
target_label: node
# 采集Kubernetes服务指标
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
# 采集Kubernetes pods指标
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
部署Prometheus服务
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.37.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus/'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=24h'
ports:
- containerPort: 9090
volumeMounts:
- name: config-volume
mountPath: /etc/prometheus
- name: data
mountPath: /prometheus
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 1Gi
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- port: 9090
targetPort: 9090
type: ClusterIP
kubectl apply -f prometheus-config.yaml
kubectl apply -f prometheus-deployment.yaml
部署Grafana
Grafana配置文件
# grafana-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-config
namespace: monitoring
data:
grafana.ini: |
[server]
domain = localhost
root_url = %(protocol)s://%(domain)s:%(http_port)s/
[auth.anonymous]
enabled = true
[auth.basic]
enabled = false
[database]
type = sqlite3
path = /var/lib/grafana/grafana.db
[paths]
data = /var/lib/grafana
logs = /var/log/grafana
plugins = /var/lib/grafana/plugins
provisioning = /etc/grafana/provisioning
部署Grafana服务
# grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:9.4.7
ports:
- containerPort: 3000
volumeMounts:
- name: config-volume
mountPath: /etc/grafana
- name: grafana-storage
mountPath: /var/lib/grafana
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 1Gi
volumes:
- name: config-volume
configMap:
name: grafana-config
- name: grafana-storage
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: grafana
namespace: monitoring
spec:
selector:
app: grafana
ports:
- port: 3000
targetPort: 3000
type: ClusterIP
kubectl apply -f grafana-config.yaml
kubectl apply -f grafana-deployment.yaml
配置Prometheus数据源
访问Grafana界面
# 创建端口转发,访问Grafana
kubectl port-forward svc/grafana 3000:3000 -n monitoring
访问 http://localhost:3000,使用默认用户名密码 admin/admin 登录。
添加Prometheus数据源
在Grafana中添加Prometheus数据源:
- 点击左侧菜单的 "Configuration"(配置)
- 点击 "Data Sources"
- 点击 "Add data source"
- 选择 "Prometheus"
- 配置数据源URL:
http://prometheus:9090 - 点击 "Save & Test"
创建监控仪表板
基础节点监控仪表板
# node-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: node-dashboard
namespace: monitoring
data:
node-dashboard.json: |
{
"dashboard": {
"id": null,
"title": "Node Monitoring",
"timezone": "browser",
"schemaVersion": 16,
"version": 0,
"refresh": "5s",
"panels": [
{
"id": 1,
"title": "CPU Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
}
},
{
"id": 2,
"title": "Memory Usage",
"type": "graph",
"datasource": "Prometheus",
"targets": [
{
"expr": "100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)",
"legendFormat": "{{instance}}"
}
],
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
"y": 0
}
}
]
}
}
部署仪表板配置
kubectl apply -f node-dashboard.yaml
配置告警规则
创建告警规则文件
# alert.rules
groups:
- name: node-alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: HighMemoryUsage
expr: 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High Memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% for more than 5 minutes"
- alert: NodeDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node is down"
description: "Node {{ $labels.instance }} is down for more than 1 minute"
部署告警规则
# alert-rules-deployment.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alert-rules
namespace: monitoring
data:
alert.rules: |
groups:
- name: node-alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes"
- alert: HighMemoryUsage
expr: 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High Memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% for more than 5 minutes"
- alert: NodeDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Node is down"
description: "Node {{ $labels.instance }} is down for more than 1 minute"
kubectl apply -f alert-rules-deployment.yaml
集成Alertmanager
Alertmanager配置
# alertmanager-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yml: |
global:
resolve_timeout: 5m
smtp_smarthost: 'localhost:25'
smtp_from: 'alertmanager@example.com'
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: 'default-receiver'
receivers:
- name: 'default-receiver'
email_configs:
- to: 'admin@example.com'
send_resolved: true
部署Alertmanager
# alertmanager-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: alertmanager
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: alertmanager
template:
metadata:
labels:
app: alertmanager
spec:
containers:
- name: alertmanager
image: prom/alertmanager:v0.24.0
args:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- containerPort: 9093
volumeMounts:
- name: config-volume
mountPath: /etc/alertmanager
- name: data
mountPath: /alertmanager
resources:
requests:
cpu: 50m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: config-volume
configMap:
name: alertmanager-config
- name: data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: alertmanager
namespace: monitoring
spec:
selector:
app: alertmanager
ports:
- port: 9093
targetPort: 9093
type: ClusterIP
配置Prometheus指向Alertmanager
# prometheus-with-alertmanager.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert.rules"
alerting:
alertmanagers:
- static_configs:
- targets:
- "alertmanager:9093"
scrape_configs:
# ... (其余配置保持不变)
服务指标收集示例
创建示例应用
# sample-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sample-app
namespace: monitoring
spec:
replicas: 3
selector:
matchLabels:
app: sample-app
template:
metadata:
labels:
app: sample-app
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
containers:
- name: sample-app
image: nginx:1.21
ports:
- containerPort: 8080
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
---
apiVersion: v1
kind: Service
metadata:
name: sample-app
namespace: monitoring
spec:
selector:
app: sample-app
ports:
- port: 8080
targetPort: 8080
type: ClusterIP
验证指标收集
# 检查Prometheus是否能收集到指标
kubectl port-forward svc/prometheus 9090:9090 -n monitoring
# 在Prometheus Web界面中查询指标
# node_cpu_seconds_total
# node_memory_MemAvailable_bytes
高级监控配置
自定义指标收集
# custom-metrics.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-metrics-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert.rules"
scrape_configs:
# 采集自定义指标
- job_name: 'custom-app'
static_configs:
- targets: ['custom-app:8080']
metrics_path: '/metrics'
scrape_interval: 30s
服务网格集成
对于使用Istio等服务网格的场景,可以配置Prometheus采集服务网格指标:
# istio-metrics.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: istio-metrics-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# 采集Istio指标
- job_name: 'istio-mesh'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: istio-telemetry
- source_labels: [__meta_kubernetes_pod_container_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
性能优化建议
Prometheus性能调优
# prometheus-optimized.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-optimized
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.37.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus/'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--web.max-concurrent-query-requests=20'
- '--web.max-remote-read-concurrent-requests=10'
- '--query.timeout=2m'
- '--query.max-samples=100000000'
ports:
- containerPort: 9090
volumeMounts:
- name: config-volume
mountPath: /etc/prometheus
- name: data
mountPath: /prometheus
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: data
persistentVolumeClaim:
claimName: prometheus-storage
存储配置
# prometheus-storage.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-storage
namespace: monitoring
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
监控最佳实践
指标设计原则
- 有意义的指标名称:使用清晰、描述性的指标名称
- 合理的标签使用:避免过多的标签维度
- 适当的采样频率:根据指标重要性设置不同的采样间隔
# 好的指标设计示例
# 好的命名
http_requests_total{method="GET",endpoint="/api/users",status="200"}
node_cpu_seconds_total{mode="idle",instance="node1"}
# 避免的命名
http_reqs_total{m="GET",e="/api/users",s="200"}
cpu_idle{node="node1"}
告警策略
- 分层告警:设置不同严重级别的告警
- 告警抑制:避免重复告警
- 告警频率控制:避免告警风暴
监控覆盖范围
# 全面的监控配置
scrape_configs:
# 基础系统指标
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
# 服务指标
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
# Pod指标
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
# 自定义应用指标
- job_name: 'custom-applications'
static_configs:
- targets: ['app1:9090', 'app2:9090']
故障排查指南
常见问题排查
-
指标无法收集:
- 检查服务标签是否正确
- 验证端口是否开放
- 检查网络策略
-
告警不触发:
- 检查告警规则表达式
- 验证Prometheus是否正确加载规则

评论 (0)