Spring Cloud微服务监控体系构建:基于Prometheus和Grafana的全链路监控实践
引言
随着企业数字化转型的深入,微服务架构已成为现代应用开发的主流选择。然而,微服务架构在带来灵活性和可扩展性的同时,也引入了复杂性挑战。如何有效监控和管理分布式系统中的各个服务组件,成为保障系统稳定性和性能的关键问题。
本文将详细介绍如何构建完整的Spring Cloud微服务监控体系,基于Prometheus和Grafana技术栈,实现从指标采集、链路追踪到可视化展示的全链路监控解决方案。
微服务监控的核心挑战
在微服务架构下,传统的单体应用监控方式已无法满足需求。主要面临以下挑战:
- 分布式追踪困难:服务间调用链路复杂,难以定位问题根源
- 指标分散:各服务独立运行,监控数据分散在不同节点
- 动态性管理:服务实例动态伸缩,监控配置需要自动适应
- 告警管理复杂:需要针对不同服务制定差异化告警策略
- 性能分析困难:缺乏全局视角,难以进行系统性能优化
监控体系架构设计
整体架构概述
基于Prometheus和Grafana的监控体系采用分层架构设计:
┌─────────────────────────────────────────────────────────────┐
│ 可视化层 (Grafana) │
├─────────────────────────────────────────────────────────────┤
│ 存储层 (Prometheus) │
├─────────────────────────────────────────────────────────────┤
│ 数据采集层 (Prometheus Exporters) │
├─────────────────────────────────────────────────────────────┤
│ 微服务应用层 (Spring Cloud) │
└─────────────────────────────────────────────────────────────┘
核心组件介绍
- Prometheus:开源监控和告警工具包,负责指标收集和存储
- Grafana:开源可视化平台,提供丰富的图表展示功能
- Spring Boot Actuator:提供应用健康检查和指标暴露
- Micrometer:应用指标门面,统一指标收集接口
- OpenTelemetry:标准化的分布式追踪解决方案
环境准备与基础配置
依赖配置
首先,在Spring Boot项目中添加必要的依赖:
<dependencies>
<!-- Spring Boot Actuator -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- Micrometer Prometheus Registry -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<!-- Spring Cloud LoadBalancer -->
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-loadbalancer</artifactId>
</dependency>
<!-- OpenTelemetry -->
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-api</artifactId>
<version>1.24.0</version>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-sdk</artifactId>
<version>1.24.0</version>
</dependency>
</dependencies>
应用配置
在application.yml中配置Actuator和Prometheus相关参数:
server:
port: 8080
management:
endpoints:
web:
exposure:
include: "*"
endpoint:
health:
show-details: always
prometheus:
enabled: true
metrics:
tags:
application: ${spring.application.name}
export:
prometheus:
enabled: true
step: 1m
descriptions: true
spring:
application:
name: user-service
cloud:
consul:
host: localhost
port: 8500
discovery:
service-name: ${spring.application.name}
instance-id: ${spring.application.name}:${server.port}
prefer-ip-address: true
logging:
level:
org.springframework.web: DEBUG
io.micrometer: DEBUG
指标采集与暴露
Actuator端点配置
启用并配置Actuator端点,暴露监控指标:
@Configuration
@EnableWebMvc
public class ActuatorConfig implements WebMvcConfigurer {
@Override
public void addCorsMappings(CorsRegistry registry) {
registry.addMapping("/actuator/**")
.allowedOrigins("*")
.allowedMethods("GET", "POST")
.allowedHeaders("*");
}
}
自定义指标收集
创建自定义指标收集器,监控业务关键指标:
@Component
public class BusinessMetricsCollector {
private final MeterRegistry meterRegistry;
private final Counter userLoginCounter;
private final Timer apiResponseTimer;
private final Gauge activeUserGauge;
public BusinessMetricsCollector(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
// 用户登录计数器
this.userLoginCounter = Counter.builder("user.login.count")
.description("用户登录次数")
.tag("application", "user-service")
.register(meterRegistry);
// API响应时间计时器
this.apiResponseTimer = Timer.builder("api.response.time")
.description("API响应时间")
.tag("application", "user-service")
.register(meterRegistry);
// 活跃用户数仪表
this.activeUserGauge = Gauge.builder("user.active.count")
.description("活跃用户数")
.tag("application", "user-service")
.register(meterRegistry, this, BusinessMetricsCollector::getActiveUserCount);
}
public void recordUserLogin(String userType) {
userLoginCounter.tag("user_type", userType).increment();
}
public Timer.Sample startApiResponseTimer() {
return Timer.start(meterRegistry);
}
public void recordApiResponseTime(Timer.Sample sample, String apiName, String statusCode) {
sample.stop(apiResponseTimer.tag("api_name", apiName)
.tag("status_code", statusCode));
}
private double getActiveUserCount() {
// 实际业务逻辑获取活跃用户数
return 1000.0;
}
}
业务层集成
在业务层集成指标收集:
@RestController
@RequestMapping("/api/users")
@Slf4j
public class UserController {
private final BusinessMetricsCollector metricsCollector;
private final UserService userService;
public UserController(BusinessMetricsCollector metricsCollector,
UserService userService) {
this.metricsCollector = metricsCollector;
this.userService = userService;
}
@PostMapping("/login")
public ResponseEntity<UserLoginResponse> login(@RequestBody UserLoginRequest request) {
Timer.Sample sample = metricsCollector.startApiResponseTimer();
try {
UserLoginResponse response = userService.login(request);
metricsCollector.recordUserLogin(request.getUserType());
metricsCollector.recordApiResponseTime(sample, "user_login", "200");
return ResponseEntity.ok(response);
} catch (Exception e) {
log.error("用户登录失败", e);
metricsCollector.recordApiResponseTime(sample, "user_login", "500");
throw e;
}
}
@GetMapping("/{id}")
public ResponseEntity<User> getUserById(@PathVariable Long id) {
Timer.Sample sample = metricsCollector.startApiResponseTimer();
try {
User user = userService.getUserById(id);
metricsCollector.recordApiResponseTime(sample, "get_user", "200");
return ResponseEntity.ok(user);
} catch (Exception e) {
log.error("获取用户信息失败", e);
metricsCollector.recordApiResponseTime(sample, "get_user", "500");
throw e;
}
}
}
Prometheus配置与部署
Prometheus配置文件
创建prometheus.yml配置文件:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
scrape_configs:
# 监控Prometheus自身
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# 监控Spring Boot应用
- job_name: 'spring-boot-apps'
metrics_path: '/actuator/prometheus'
consul_sd_configs:
- server: 'localhost:8500'
services: []
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: job
- source_labels: [__meta_consul_service_address, __meta_consul_service_port]
separator: ':'
regex: (.*)
target_label: __address__
replacement: ${1}
- source_labels: [__meta_consul_service]
target_label: service
# 监控Node Exporter
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Alert Rules配置
创建告警规则文件alert_rules.yml:
groups:
- name: spring-boot-alerts
rules:
# 高CPU使用率告警
- alert: HighCPUUsage
expr: rate(process_cpu_seconds_total[2m]) > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "高CPU使用率 (实例 {{ $labels.instance }})"
description: "{{ $labels.instance }} 的CPU使用率超过80%"
# 高内存使用率告警
- alert: HighMemoryUsage
expr: (jvm_memory_used_bytes / jvm_memory_max_bytes) > 0.8
for: 2m
labels:
severity: warning
annotations:
summary: "高内存使用率 (实例 {{ $labels.instance }})"
description: "{{ $labels.instance }} 的内存使用率超过80%"
# 服务不可用告警
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务不可用 (实例 {{ $labels.instance }})"
description: "{{ $labels.instance }} 服务不可用"
# 高错误率告警
- alert: HighErrorRate
expr: rate(http_server_requests_seconds_count{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "高错误率 (实例 {{ $labels.instance }})"
description: "{{ $labels.instance }} 的HTTP错误率超过5%"
Docker Compose部署
创建docker-compose.yml文件进行容器化部署:
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.43.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert_rules.yml:/etc/prometheus/alert_rules.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
restart: unless-stopped
grafana:
image: grafana/grafana-enterprise:9.4.7
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:v0.25.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
node-exporter:
image: prom/node-exporter:v1.5.0
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
分布式链路追踪
OpenTelemetry集成
配置OpenTelemetry进行分布式追踪:
@Configuration
public class OpenTelemetryConfig {
@Bean
public OpenTelemetry openTelemetry() {
Resource resource = Resource.getDefault()
.merge(Resource.create(Attributes.of(
ResourceAttributes.SERVICE_NAME, "user-service"
)));
SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder()
.addSpanProcessor(BatchSpanProcessor.builder(OtlpGrpcSpanExporter.builder()
.setEndpoint("http://localhost:4317")
.build()).build())
.setResource(resource)
.build();
OpenTelemetrySdk openTelemetrySdk = OpenTelemetrySdk.builder()
.setTracerProvider(sdkTracerProvider)
.setPropagators(ContextPropagators.create(
TextMapPropagator.composite(
W3CTraceContextPropagator.getInstance(),
W3CBaggagePropagator.getInstance())))
.buildAndRegisterGlobal();
return openTelemetrySdk;
}
@Bean
public Tracer tracer(OpenTelemetry openTelemetry) {
return openTelemetry.getTracer("user-service");
}
}
链路追踪拦截器
创建链路追踪拦截器:
@Component
@Slf4j
public class TracingInterceptor implements HandlerInterceptor {
private final Tracer tracer;
public TracingInterceptor(Tracer tracer) {
this.tracer = tracer;
}
@Override
public boolean preHandle(HttpServletRequest request,
HttpServletResponse response,
Object handler) throws Exception {
Span span = tracer.spanBuilder(request.getRequestURI())
.setAttribute("http.method", request.getMethod())
.setAttribute("http.url", request.getRequestURL().toString())
.setAttribute("http.user_agent", request.getHeader("User-Agent"))
.startSpan();
Context currentContext = Context.current().with(span);
Scope scope = currentContext.makeCurrent();
request.setAttribute("span", span);
request.setAttribute("scope", scope);
return true;
}
@Override
public void afterCompletion(HttpServletRequest request,
HttpServletResponse response,
Object handler, Exception ex) throws Exception {
Span span = (Span) request.getAttribute("span");
Scope scope = (Scope) request.getAttribute("scope");
if (span != null) {
span.setAttribute("http.status_code", response.getStatus());
if (ex != null) {
span.recordException(ex);
span.setStatus(StatusCode.ERROR, ex.getMessage());
}
span.end();
}
if (scope != null) {
scope.close();
}
}
}
WebMvc配置
注册拦截器:
@Configuration
public class WebMvcConfig implements WebMvcConfigurer {
private final TracingInterceptor tracingInterceptor;
public WebMvcConfig(TracingInterceptor tracingInterceptor) {
this.tracingInterceptor = tracingInterceptor;
}
@Override
public void addInterceptors(InterceptorRegistry registry) {
registry.addInterceptor(tracingInterceptor)
.addPathPatterns("/api/**");
}
}
Grafana可视化配置
数据源配置
通过Grafana UI或配置文件添加Prometheus数据源:
# grafana/provisioning/datasources/datasources.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
jsonData:
timeInterval: "15s"
仪表板配置
创建自定义仪表板JSON配置:
{
"dashboard": {
"id": null,
"title": "Spring Boot Application Dashboard",
"tags": ["spring-boot", "microservice"],
"timezone": "browser",
"panels": [
{
"type": "graph",
"title": "JVM Memory Usage",
"gridPos": {
"x": 0,
"y": 0,
"w": 12,
"h": 8
},
"targets": [
{
"expr": "jvm_memory_used_bytes{application=\"$application\", instance=\"$instance\"}",
"legendFormat": "{{area}} - {{id}}"
}
],
"datasource": "Prometheus"
},
{
"type": "graph",
"title": "HTTP Request Rate",
"gridPos": {
"x": 12,
"y": 0,
"w": 12,
"h": 8
},
"targets": [
{
"expr": "rate(http_server_requests_seconds_count[1m])",
"legendFormat": "{{method}} {{uri}} {{status}}"
}
],
"datasource": "Prometheus"
}
],
"templating": {
"list": [
{
"name": "application",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(application)"
},
{
"name": "instance",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up{application=\"$application\"}, instance)"
}
]
}
}
}
告警配置
在Grafana中配置告警规则:
# grafana/provisioning/alerting/alerts.yml
apiVersion: 1
groups:
- orgId: 1
name: spring-boot-alerts
folder: Alerts
interval: 60s
rules:
- uid: high-cpu-usage
title: High CPU Usage Alert
condition: B
data:
- refId: A
relativeTimeRange:
from: 600
to: 0
datasourceUid: prometheus
model:
expr: rate(process_cpu_seconds_total[2m]) > 0.8
intervalMs: 1000
maxDataPoints: 43200
refId: A
- refId: B
relativeTimeRange:
from: 600
to: 0
datasourceUid: __expr__
model:
conditions:
- evaluator:
params:
- 0
- 0
type: gt
operator:
type: and
query:
params:
- A
reducer:
params: []
type: last
type: query
datasource:
type: __expr__
uid: __expr__
expression: A
intervalMs: 1000
maxDataPoints: 43200
refId: B
type: classic_conditions
noDataState: NoData
execErrState: Error
for: 2m
annotations:
summary: High CPU usage detected
labels:
severity: warning
告警通知配置
Alertmanager配置
创建alertmanager.yml配置文件:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alert@example.com'
smtp_auth_username: 'your-email@gmail.com'
smtp_auth_password: 'your-app-password'
smtp_require_tls: true
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email-notifications'
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'email-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: 'ops-team@example.com'
send_resolved: true
- name: 'pagerduty'
pagerduty_configs:
- routing_key: 'YOUR_PAGERDUTY_ROUTING_KEY'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Webhook集成
创建自定义Webhook处理器:
@RestController
@RequestMapping("/webhook")
@Slf4j
public class AlertWebhookController {
@PostMapping("/alert")
public ResponseEntity<String> handleAlert(@RequestBody String alertData) {
log.info("收到告警通知: {}", alertData);
// 处理告警逻辑
processAlert(alertData);
return ResponseEntity.ok("Alert received");
}
private void processAlert(String alertData) {
// 解析告警数据
try {
ObjectMapper mapper = new ObjectMapper();
JsonNode alertNode = mapper.readTree(alertData);
// 发送通知到企业微信、钉钉等
sendNotification(alertNode);
} catch (Exception e) {
log.error("处理告警失败", e);
}
}
private void sendNotification(JsonNode alertNode) {
// 实现通知发送逻辑
// 可以集成企业微信、钉钉、Slack等
}
}
性能优化与最佳实践
监控数据采样优化
配置合理的采样策略:
@Configuration
public class MicrometerConfig {
@Bean
public MeterRegistryCustomizer<PrometheusMeterRegistry> prometheusMeterRegistryCustomizer() {
return registry -> {
// 配置直方图百分位数
registry.config()
.meterFilter(new MeterFilter() {
@Override
public DistributionStatisticConfig configure(Meter.Id id,
DistributionStatisticConfig config) {
if (id.getName().startsWith("http")) {
return DistributionStatisticConfig.builder()
.percentiles(0.5, 0.9, 0.95, 0.99)
.build()
.merge(config);
}
return config;
}
});
};
}
}
标签优化
合理使用标签避免高基数问题:
@Component
public class MetricsTaggingService {
private final MeterRegistry meterRegistry;
public MetricsTaggingService(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
}
public void recordApiCall(String apiName, String method, int statusCode, long duration) {
// 避免使用高基数标签如用户ID、请求ID等
Timer.builder("api.call.duration")
.tag("api_name", normalizeApiName(apiName))
.tag("method", method)
.tag("status", String.valueOf(statusCode / 100)) // 使用状态码类别而非具体值
.register(meterRegistry)
.record(duration, TimeUnit.MILLISECONDS);
}
private String normalizeApiName(String apiName) {
// 标准化API名称,避免过多变体
return apiName.replaceAll("/\\d+", "/{id}");
}
}
资源监控
监控系统资源使用情况:
@Component
public class SystemMetricsCollector {
private final MeterRegistry meterRegistry;
public SystemMetricsCollector(MeterRegistry meterRegistry) {
this.meterRegistry = meterRegistry;
// 注册JVM指标
new ClassLoaderMetrics().bindTo(meterRegistry);
new JvmMemoryMetrics().bindTo(meterRegistry);
new JvmGcMetrics().bindTo(meterRegistry);
new ProcessorMetrics().bindTo(meterRegistry);
new JvmThreadMetrics().bindTo(meterRegistry);
}
@Scheduled(fixedRate = 30000)
public void collectCustomMetrics() {
// 自定义系统指标收集
collectDiskUsage();
collectNetworkStats();
}
private void collectDiskUsage() {
try {
File file = new File(".");
long totalSpace = file.getTotalSpace();
long freeSpace = file.getFreeSpace();
long usedSpace = totalSpace - freeSpace;
Gauge.builder("disk.used.bytes")
.register(meterRegistry, usedSpace);
Gauge.builder("disk.utilization.percent")
.register(meterRegistry, (usedSpace * 100.0) / totalSpace);
} catch (Exception e) {
// 处理异常
}
}
}
安全配置
访问控制
配置安全访问控制:
@Configuration
@EnableWebSecurity
public class SecurityConfig {
@Bean
public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
http
.authorizeHttpRequests(authz -> authz
.requestMatchers("/actuator/health", "/actuator/info").permitAll()
.requestMatchers("/actuator/prometheus").hasRole("MONITORING")
.requestMatchers("/actuator/**").hasRole("ADMIN")
.anyRequest().authenticated()
)
.httpBasic(withDefaults())
.csrf(csrf -> csrf.ignoringRequestMatchers("/actuator/**"));
return http.build();
}
}
TLS配置
启用HTTPS加密传输:
server:
ssl:
enabled: true
key-store: classpath:keystore.p12
key-store-password: password
key-store-type: PKCS12
key-alias: tomcat
management:
server:
ssl:
enabled: true
key-store: classpath:keystore.p12
key-store-password: password
key-store-type: PKCS12
key-alias: management
评论 (0)