【SpringCloud从入门到架构师】第13章服务监控与告警体系

张

张建站

2026/5/20 8:29:09

10分钟阅读

1. SpringBoot Actuator监控端点配置Spring Boot Actuator 提供了丰富的监控端点以下是详细的配置说明基础配置1.1 添加依赖dependency groupIdorg.springframework.boot/groupId artifactIdspring-boot-starter-actuator/artifactId /dependency !-- 如果需要健康检查详情 -- dependency groupIdorg.springframework.boot/groupId artifactIdspring-boot-starter-web/artifactId /dependency1.2 基础配置management: endpoints: web: exposure: include: * # 暴露所有端点 # 或指定特定端点 # include: health,info,metrics,prometheus base-path: /actuator # 端点基础路径 path-mapping: health: healthcheck # 自定义端点路径 jmx: exposure: include: * endpoint: health: show-details: always # 显示健康详情 show-components: always probes: enabled: true # 启用就绪/存活探针 shutdown: enabled: true # 启用关闭端点谨慎使用常用端点配置2.1 Health 端点management: endpoint: health: show-details: always show-components: always group: readiness: include: db,redis,diskSpace liveness: include: ping status: order: DOWN, OUT_OF_SERVICE, UP, UNKNOWN http-mapping: DOWN: 503 OUT_OF_SERVICE: 503自定义健康指示器Component public class CustomHealthIndicator implements HealthIndicator { Override public Health health() { boolean isHealthy checkService(); if (isHealthy) { return Health.up() .withDetail(service, available) .withDetail(timestamp, Instant.now()) .build(); } return Health.down() .withDetail(service, unavailable) .build(); } }2.2 Metrics 端点management: metrics: export: prometheus: enabled: true step: 1m web: server: request: autotime: enabled: true enable: jvm: true system: true logback: true process: true distribution: percentiles-histogram: http.server.requests: true sla: http.server.requests: 100ms, 200ms, 500ms自定义指标Service public class OrderService { private final MeterRegistry meterRegistry; private final Counter orderCounter; public OrderService(MeterRegistry meterRegistry) { this.meterRegistry meterRegistry; this.orderCounter Counter.builder(orders.total) .description(Total number of orders) .register(meterRegistry); } public void createOrder() { orderCounter.increment(); Timer timer Timer.builder(order.processing.time) .register(meterRegistry); timer.record(() - { // 处理订单 }); } }2.3 Info 端点management: info: env: enabled: true java: enabled: true os: enabled: true info: app: name: My Application version: 1.0.0 description: Spring Boot Application contact: email: adminexample.com custom: deploy-time: ${DEPLOY_TIME}编程方式添加信息Component public class BuildInfoContributor implements InfoContributor { Override public void contribute(Info.Builder builder) { MapString, String buildInfo new HashMap(); buildInfo.put(build.version, 2.5.0); buildInfo.put(build.time, Instant.now().toString()); builder.withDetail(build, buildInfo); } }安全配置3.1 安全访问控制Configuration EnableWebSecurity public class ActuatorSecurityConfig { Bean public SecurityFilterChain actuatorSecurityFilterChain(HttpSecurity http) throws Exception { http .securityMatcher(/actuator/**) .authorizeHttpRequests(authz - authz .requestMatchers(/actuator/health, /actuator/info).permitAll() .requestMatchers(/actuator/**).hasRole(ADMIN) ) .httpBasic(Customizer.withDefaults()) .csrf(csrf - csrf.disable()); return http.build(); } }3.2 使用 Spring Securityspring: security: user: name: admin password: {bcrypt}$2a$10$... roles: ADMIN高级配置4.1 自定义端点Component Endpoint(id custom) public class CustomEndpoint { ReadOperation public MapString, Object getInfo() { MapString, Object info new HashMap(); info.put(status, UP); info.put(timestamp, System.currentTimeMillis()); return info; } WriteOperation public String executeOperation(Selector String operation) { return Executed: operation; } }4.2 启用特定端点management: endpoint: beans: enabled: true conditions: enabled: true configprops: enabled: true env: enabled: true heapdump: enabled: true httptrace: enabled: true loggers: enabled: true mappings: enabled: true prometheus: enabled: true scheduledtasks: enabled: true sessions: enabled: true threaddump: enabled: true4.3 集成 Prometheusmanagement: metrics: export: prometheus: enabled: true descriptions: true step: 30s simple: enabled: false web: server: request: autotime: enabled: true tags: application: ${spring.application.name} environment: ${ENV:dev}生产环境最佳实践5.1 生产配置示例management: endpoints: web: exposure: include: health,info,metrics,prometheus base-path: /internal/actuator discovery: enabled: false endpoint: health: show-details: when-authorized show-components: when-authorized probes: enabled: true group: readiness: include: * liveness: include: ping info: enabled: true metrics: enabled: true prometheus: enabled: true metrics: export: prometheus: enabled: true step: 30s enable: jvm: true system: true process: true distribution: percentiles-histogram: http.server.requests: true server: port: 8081 # 与业务端口分离 address: 127.0.0.1 # 仅本地访问 health: redis: enabled: true db: enabled: true diskspace: enabled: true threshold: 10MB5.2 Kubernetes 健康检查# application-k8s.yaml management: endpoint: health: probes: enabled: true show-details: never endpoints: web: exposure: include: health base-path: / health: livenessstate: enabled: true readinessstate: enabled: trueKubernetes 配置# deployment.yaml livenessProbe: httpGet: path: /actuator/health/liveness port: 8080 initialDelaySeconds: 60 periodSeconds: 10 readinessProbe: httpGet: path: /actuator/health/readiness port: 8080 initialDelaySeconds: 30 periodSeconds: 5常用端点列表端点路径描述生产建议health/actuator/health应用健康状态✅ 暴露info/actuator/info应用信息✅ 暴露metrics/actuator/metrics应用指标✅ 暴露prometheus/actuator/prometheusPrometheus 格式指标✅ 暴露loggers/actuator/loggers查看/修改日志级别⚠️ 受限访问env/actuator/env环境变量❌ 不暴露beans/actuator/beansSpring Beans❌ 不暴露mappings/actuator/mappings请求映射❌ 不暴露shutdown/actuator/shutdown关闭应用❌ 不暴露注意事项安全第一生产环境不要暴露敏感端点最小权限原则按需暴露端点监控告警集成监控系统Prometheus Grafana性能影响注意监控数据收集的性能开销版本兼容不同 Spring Boot 版本端点可能不同这样配置后可以通过 http://localhost:8080/actuator 访问监控端点。2. PrometheusGrafana指标监控、可视化面板搭建一、核心组件介绍1. Prometheus - 监控数据采集与存储时序数据库专门存储监控指标拉取模式Pull收集数据支持服务发现内置告警规则引擎2. Grafana - 数据可视化丰富的图表展示支持多种数据源可自定义仪表板告警通知功能二、安装部署Docker Compose方式推荐version: 3.8 services: prometheus: image: prom/prometheus:latest container_name: prometheus ports: - 9090:9090 volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus command: - --config.file/etc/prometheus/prometheus.yml - --storage.tsdb.path/prometheus - --web.console.libraries/etc/prometheus/console_libraries - --web.console.templates/etc/prometheus/consoles - --storage.tsdb.retention.time30d restart: unless-stopped grafana: image: grafana/grafana:latest container_name: grafana ports: - 3000:3000 volumes: - grafana_data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning environment: - GF_SECURITY_ADMIN_PASSWORDadmin123 restart: unless-stopped volumes: prometheus_data: grafana_data:Prometheus配置文件prometheus/prometheus.yml:global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: [] rule_files: [] scrape_configs: - job_name: prometheus static_configs: - targets: [localhost:9090] - job_name: node_exporter static_configs: - targets: [node-exporter:9100] - job_name: docker static_configs: - targets: [docker-exporter:9323]三、常用Exporter安装1. Node Exporter - 主机监控docker run -d \ --namenode-exporter \ -p 9100:9100 \ -v /proc:/host/proc:ro \ -v /sys:/host/sys:ro \ -v /:/rootfs:ro \ prom/node-exporter2. cAdvisor - 容器监控docker run \ --volume/:/rootfs:ro \ --volume/var/run:/var/run:ro \ --volume/sys:/sys:ro \ --volume/var/lib/docker/:/var/lib/docker:ro \ --volume/dev/disk/:/dev/disk:ro \ --publish8080:8080 \ --detachtrue \ --namecadvisor \ gcr.io/cadvisor/cadvisor:latest3. MySQL Exporterdocker run -d \ --namemysql-exporter \ -p 9104:9104 \ -e DATA_SOURCE_NAMEuser:password(hostname:3306)/ \ prom/mysqld-exporter四、Grafana配置添加数据源访问http://localhost:3000登录默认admin/adminConfiguration → Data Sources → Add data source选择Prometheus填写URLhttp://prometheus:9090导入仪表板常用仪表板IDNode Exporter 1860Docker 893MySQL 7362Redis 763导入步骤Dashboards → Import输入仪表板ID或JSON选择Prometheus数据源六、告警配置Alertmanager配置global: smtp_smarthost: smtp.gmail.com:587 smtp_from: alertexample.com smtp_auth_username: usergmail.com smtp_auth_password: password route: group_by: [alertname] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: email-notifications receivers: - name: email-notifications email_configs: - to: adminexample.com send_resolved: truePrometheus告警规则groups: - name: example rules: - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 85 for: 5m labels: severity: warning annotations: summary: High memory usage on {{ $labels.instance }} description: Memory usage is above 85% (current value: {{ $value }}%)七、优化建议性能优化# prometheus.yml优化 storage: tsdb: retention: 30d max-block-duration: 2h min-block-duration: 2h # 启动参数优化 --storage.tsdb.retention.time30d --storage.tsdb.retention.size50GB --query.max-concurrency20 --query.timeout2m高可用方案使用Thanos 或Cortex 实现长期存储和高可用多Prometheus实例负载均衡跨区域复制八、监控面板示例系统资源监控CPU使用率内存使用率磁盘IO网络流量系统负载业务监控请求QPS响应时间错误率业务指标订单数、用户数等数据库监控连接数查询性能慢查询锁等待九、常用命令# 检查配置文件 promtool check config prometheus.yml # 测试告警规则 promtool test rules test.yml # 查询指标 curl http://localhost:9090/api/v1/query?queryup # 查看目标状态 curl http://localhost:9090/api/v1/targets十、故障排查Prometheus无法采集数据检查网络连通性验证exporter是否运行检查防火墙规则Grafana无法显示数据确认数据源连接正常检查PromQL语法查看时间范围设置磁盘空间不足调整数据保留策略清理旧数据增加存储容量这个监控体系可以满足大部分场景的需求。根据实际业务情况可以添加更多的exporter和自定义指标构建完整的可观测性平台。3. JVM监控、接口QPS、错误率、响应时间监控一、JVM监控核心指标1. 内存监控// 关键指标 - 堆内存Eden、Survivor、Old Gen使用率 - 非堆内存Metaspace、Code Cache - 直接内存Direct Buffer使用情况 - GC情况Young GC/Old GC频率、耗时、吞吐量2. 线程监控线程总数、活动线程数死锁检测线程状态分布RUNNABLE、BLOCKED、WAITING等3. 类加载监控加载类数量卸载类数量Metaspace使用趋势二、接口监控核心指标1. QPS每秒查询率# Prometheus示例 rate(http_requests_total[5m])2. 错误率# 错误率计算 sum(rate(http_requests_total{status~5..}[5m])) / sum(rate(http_requests_total[5m]))3. 响应时间平均响应时间avg分位数P50、P90、P95、P99、P999最大响应时间三、监控工具栈1. JVM监控工具# 常用工具组合 - JMX: 基础监控接口 - Micrometer: 指标收集标准 - Prometheus Grafana: 监控展示 - Arthas: 在线诊断工具 - JProfiler/YourKit: 性能分析2. APM全链路监控SkyWalking / Pinpoint / Zipkin ├── 分布式追踪 ├── 服务拓扑图 ├── 慢查询分析 └── 依赖分析四、Spring Boot监控实现示例1. 添加依赖dependency groupIdorg.springframework.boot/groupId artifactIdspring-boot-starter-actuator/artifactId /dependency dependency groupIdio.micrometer/groupId artifactIdmicrometer-registry-prometheus/artifactId /dependency2. 配置示例management: endpoints: web: exposure: include: health,info,metrics,prometheus metrics: export: prometheus: enabled: true distribution: percentiles-histogram: http.server.requests: true percentiles: http.server.requests: 0.5, 0.9, 0.95, 0.993. 自定义指标Component public class ApiMetrics { private final MeterRegistry meterRegistry; private final Timer apiTimer; private final Counter errorCounter; public ApiMetrics(MeterRegistry meterRegistry) { this.meterRegistry meterRegistry; this.apiTimer Timer.builder(api.request.duration) .description(API请求耗时) .tags(service, user-service) .register(meterRegistry); this.errorCounter Counter.builder(api.error.count) .description(API错误次数) .tag(type, business) .register(meterRegistry); } public void recordRequest(Runnable request, String apiName) { apiTimer.record(() - { try { request.run(); } catch (Exception e) { errorCounter.increment(); throw e; } }); } }五、Grafana监控面板配置1. JVM监控面板- 堆内存使用趋势图 - GC次数和耗时 - 线程状态堆叠图 - CPU使用率2. 接口监控面板{ panels: [ { title: QPS监控, targets: [ rate(http_requests_total[1m]) ] }, { title: 响应时间P99, targets: [ histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) ] }, { title: 错误率, targets: [ sum(rate(http_requests_total{status~\5..\}[5m])) / sum(rate(http_requests_total[5m])) * 100 ], format: percent } ] }六、告警规则配置1. Prometheus告警规则groups: - name: api_alerts rules: - alert: HighErrorRate expr: sum(rate(http_requests_total{status~5..}[5m])) / sum(rate(http_requests_total[5m])) 0.05 for: 2m labels: severity: warning annotations: summary: 接口错误率过高 - alert: SlowResponse expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) 2 for: 5m labels: severity: warning - alert: JVMHeapHigh expr: jvm_memory_used_bytes{areaheap} / jvm_memory_max_bytes{areaheap} 0.8 for: 5m2. 告警渠道邮件通知企业微信/钉钉机器人Slack/Webhook短信/电话紧急情况七、最佳实践建议1. 监控分级L1 - 核心指标必须监控: - 服务存活状态 - 错误率 5% - P99响应时间 2s L2 - 业务指标: - 关键接口QPS - 业务错误码分布 - 数据库连接池 L3 - 深度诊断: - 慢SQL分析 - 线程堆栈 - JVM调优指标2. 容量规划QPS预测历史峰值 * 安全系数(1.5-2) 内存规划堆内存非堆内存系统预留线程池 (核心线程数 * 2) 缓冲队列3. 监控看板设计原则黄金指标延迟、流量、错误、饱和度分层展示全局 → 服务 → 实例 → 方法对比分析同比、环比、基准线八、故障排查流程1. 收到告警 ↓ 2. 查看监控面板定位问题 ↓ 3. 分析日志ELK ↓ 4. 使用诊断工具Arthas ↓ 5. 线程/堆转储分析 ↓ 6. 代码级定位 ↓ 7. 修复验证需要我详细展开某个特定部分吗比如具体的工具配置、某个指标的优化策略或者特定场景的监控方案

eclipse + ADT 实现AI 辅助ABAP编程

SAP Eclipse ADT开发环境配置与应用 1、确定Eclipse版本2、下载Eclipse3、Eclipse 安装组件4、连接SAP服务器5、ABAP开发6、CDS开发 1、确定Eclipse版本下载前，首选参考SAP开发工具页面，根据自己的开发需求，选择合适的Eclipse版本这里我主…...

2026/5/20 8:27:28 阅读更多 →

华硕笔记本终极控制方案：如何用G-Helper替代臃肿的Armoury Crate

华硕笔记本终极控制方案：如何用G-Helper替代臃肿的Armoury Crate 【免费下载链接】g-helper Lightweight Armoury Crate alternative for Asus laptops with nearly the same functionality. Works with ROG Zephyrus, Flow, TUF, Strix, Scar, ProArt, Vivobook, Z…...

2026/5/20 8:25:07 阅读更多 →

ComfyUI v0.21.1：最新版本发布，模型、节点、工作流与稳定性全面升级

ComfyUI v0.21.1 已于 2026年5月14日发布。本次版本说明中明确标注为 Immutable release，也就是说，发布后只能修改 release title 和 notes。这意味着这次更新内容具有较强的定版性质，适合直接作为版本升级参考。如果用一句话概括这次更新&a…...

2026/5/20 8:21:23 阅读更多 →

在Taotoken模型广场中根据场景选择合适的模型

🚀 告别海外账号与网络限制！稳定直连全球优质大模型，限时半价接入中。 👉 点击领取海量免费额度在Taotoken模型广场中根据场景选择合适的模型面对众多大模型厂商和琳琅满目的模型，开发者常常面临选择困难&#xff1…...

2026/5/19 14:18:54 阅读更多 →

Agent 一接流式 API 就开始响应断层：从 Delta Parsing 到 Final Assembly 的工程实战

很多开发者以为 Agent 接入流式 API 只是"开个 SSE 连接、逐字渲染"这么简单。直到生产环境报错：用户的话说到一半突然断层，工具参数在流中被截成两半，多轮对话上下句粘在一起。这些问题不是网络抖动，而是 Delta 解析和…...

2026/5/19 14:18:56 阅读更多 →

ESP-SR语音识别框架深度剖析：高性能嵌入式唤醒词与命令识别解决方案

ESP-SR语音识别框架深度剖析：高性能嵌入式唤醒词与命令识别解决方案【免费下载链接】esp-sr Speech recognition 项目地址: https://gitcode.com/gh_mirrors/es/esp-sr ESP-SR是乐鑫推出的高性能嵌入式语音识别框架，专为资源受限的物联网设备设计…...

2026/5/19 14:18:58 阅读更多 →