文章目录
一. 场景
项目将要开放给业务部门使用, 需要保证服务的稳定性, 那么下面演示下是怎么借助Prometheus,Grafana实现服务的指标监控,以及触发阈值时 发送钉钉消息告警。
二. 服务安装
1. 部署prometheus
1.1 下载最新的安装包

1.2 解压
tar -zxvf prometheus-3.5.0.linux-amd64.tar.gz
1.3 启动
nohup ./prometheus --config.file=prometheus.yml &

1.4 访问9090端口

2. 部署grafana
服务器拉镜像失败,下面从有合适网络环境机器的拿到镜像,然后在服务器加载。
2.1 其他机器下载镜像
docker pull grafana/grafana:latest
docker save -o grafana.tar grafana/grafana:latest
2.2 导入服务器并加载grafanaf容器
docker load -i grafana.tar
docker run -d --name grafana -p 3000:3000 grafana/grafana:latest
2.3 登录
- 默认账号密码 admin admin
- 默认端口是3000
三. springboot项目接入Prometheus
3.1 引入依赖
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
3.2 开启Prometheus配置
management:
endpoints:
web:
exposure:
include: "prometheus,health" # 必须包含 prometheus
base-path: /actuator # 默认就是 /actuator
endpoint:
prometheus:
enabled: true # 启用 prometheus 端点
3.3 允许匿名访问actuator/prometheus
本服务使用的是ruoyi框架,所以通过下面的方式打开匿名访问
SecurityConfig#filterChain

3.4 nginx反向代理仅允许内网访问actuator/prometheus
非allow的ip都拒绝访问
location /actuator/prometheus {
allow 10.0.0.0/8;
allow 192.168.0.0/16;
allow 172.16.0.0/12;
deny all;
proxy_pass http://localhost:8080;
}
3.5 业务埋点
3.5.1 定制业务指标
1 private final ConcurrentHashMap<String, Counter> counterCache = new ConcurrentHashMap<> ();
维护指标和计数器的缓存关系。
2. 使用指标初始化Counter指标
Counter counter = counterCache.computeIfAbsent(buildCacheKey(eventDTO), k ->
Counter.builder("backend_team_custom_event")
.description("后端团队自定义事件")
.tag("event", eventDTO.getEvent())
.tag("event_cn", eventDTO.getEventCn())
.tag("result", eventDTO.getResultAsString())
.tag("remark_category", buildCategorizeRemark(eventDTO.getRemark()))
.tag("env", activeProfile)
.tag("project", project)
.register(meterRegistry)
);
3. 计数器自增
counter.increment();
@Data
@ApiModel("上报的事件")
@Builder
@AllArgsConstructor
@NoArgsConstructor
public class AlertEventDTO {
/**
* event 事件, event_cn 事件中文名,result 布什尔,代表事件结果, remark 备注, other 其他
*/
@ApiModelProperty("事件 英文名")
private String event;
@ApiModelProperty("事件中文名")
private String eventCn;
@ApiModelProperty("事件结果 布尔")
private Boolean result;
@ApiModelProperty("事件备注")
private String remark;
/**
* 获取结果的标准字符串表示
*/
public String getResultAsString() {
if (result == null) {
return "UNKNOWN";
}
return result ? "SUCCESS" : "FAILED";
}
}
3.5.2 业务技术埋点
@Component
@Slf4j
public class EventMetricService {
@Autowired
private MeterRegistry meterRegistry;
private final ConcurrentHashMap<String, Counter> counterCache = new ConcurrentHashMap<>();
@Value("${spring.profiles.active:default}")
private String activeProfile;
@Value("${spring.application.name}")
private String project;
/**
* 上报事件
*
* @param eventDTO
*/
public void reportEvent(AlertEventDTO eventDTO) {
// 1.0 构建指标计数器
Counter counter = counterCache.computeIfAbsent(buildCacheKey(eventDTO), k ->
Counter.builder("backend_team_custom_event")
.description("后端团队自定义事件")
.tag("event", eventDTO.getEvent())
.tag("event_cn", eventDTO.getEventCn())
.tag("result", eventDTO.getResultAsString())
.tag("remark_category", buildCategorizeRemark(eventDTO.getRemark()))
.tag("env", activeProfile)
.tag("project", project)
.register(meterRegistry)
);
counter.increment();
// 2.0 记录时间上下问题日志
logRemarkDetail(eventDTO);
}
/**
* 构建缓存key
*
* @param alertEventDTO
* @return
*/
private String buildCacheKey(AlertEventDTO alertEventDTO) {
return String.format("event:%s|event_cn:%s|result:%s|remark_category:%s|env:%s|project:%s",
alertEventDTO.getEvent(), alertEventDTO.getEventCn(),
alertEventDTO.getResultAsString(), buildCategorizeRemark(alertEventDTO.getRemark()), activeProfile, project
);
}
/**
* 构建上线文分类
*
* @param remark
* @return
*/
private String buildCategorizeRemark(String remark) {
if (remark == null) {
return "null";
}
List<String> categoryList = Arrays.asList("Duplicate entry", "foreign key constraint fails", "Data too long", "doesn't have a default value",
"Deadlock", "Lock wait timeout", "ORA-00904", "invalid identifier", "Table doesn't exist", "Connection refused", "Read timed out",
"UnknownHost", "404", "Not Found", "500", "401", "403", "NullPointer", "IndexOutOfBounds", "ClassCast", "NumberFormat", "FileNotFound",
"OutOfMemory", "BeanCreation", "Parameter not found", "Redis connection failed", "IllegalArgumentException", "IllegalStateException", "UnsupportedOperationException",
"NoSuchMethodException", "NoSuchFieldException", "CloneNotSupportedException", "InterruptedException", "ArithmeticException", "NegativeArraySizeException",
"SecurityException", "IOException", "EOFException", "SocketException", "SSLException", "SerializationException", "ObjectStreamException", "UnsupportedEncodingException",
"MalformedURLException", "ZipException", "InvocationTargetException", "InstantiationException", "ClassNotFoundException", "NoClassDefFoundError",
"UnsatisfiedLinkError", "ExceptionInInitializerError", "ExecutionException", "CancellationException", "RejectedExecutionException",
"BrokenBarrierException", "TimeoutException", "CompletionException", "DataIntegrityViolationException", "ConcurrencyFailureException",
"CannotAcquireLockException", "OptimisticLockingFailureException", "PessimisticLockingException", "NonUniqueResultException", "EmptyResultDataAccessException",
"QueryTimeoutException", "SQLException", "SQLSyntaxErrorException", "SQLIntegrityConstraintViolationException", "HttpMessageNotReadableException",
"HttpMessageNotWritableException", "HttpMediaTypeNotSupportedException", "HttpMediaTypeNotAcceptableException", "MissingServletRequestParameterException",
"MethodArgumentNotValidException", "BindException", "NoHandlerFoundException", "AsyncRequestTimeoutException", "WebClientResponseException", "ServerWebInputException",
"AuthenticationException", "AccessDeniedException", "InsufficientAuthenticationException", "BadCredentialsException", "JWTVerificationException", "SignatureException",
"InvalidKeyException", "ConfigurationException", "NoSuchBeanDefinitionException", "NoUniqueBeanDefinitionException", "BeanDefinitionStoreException", "BeanInitializationException",
"CircularDependencyException", "JMSException", "MessageConversionException", "AmqpException", "AmqpTimeoutException", "MessagingException", "AssertionError",
"UnexpectedTestFailureException");
for (String staticRemark : categoryList) {
if (remark.contains(staticRemark)) {
return staticRemark;
}
}
if (remark.length() > 50) {
return "long_text";
}
return "other";
}
/**
* 记录上报事件详情
*
* @param eventDTO
*/
private void logRemarkDetail(AlertEventDTO eventDTO) {
if (eventDTO.getRemark() != null && !eventDTO.getRemark().isEmpty()) {
log.info("记录上报事件的原文 - event: {}, remark: {}",
eventDTO.getEvent(), eventDTO.getRemark());
}
}
}
3.5.3 使用举例
目前记录了定时任务执行结果和异常增长

四. Prometheus扫描springboot端点
在prometheus-3.5.0.linux-amd64/prometheus.yml添加配置, 即自己项目的/actuator/prometheus

五. 配置监控面板
5.1 下图配置之后的大盘
### 5.2下面演示下 异常占比饼图的配置方法
- backend_team_custom_event_total{env=“prod”, event=“exception”, instance=“172.24.106.194:8089”} 是Counter 类型指标 + 标签(Labels) 的标准写法;
含义为生产环境 prod 的某台机器 172.24.106.194 上,事件类型 “exception” 的自定义事件总次数。- 语法格式 <metric_name>{<label_key>=<label_value>, …}

六. 配置告警规则
6.1 异常增长率过高告警
6.1.1 配置近2分钟异常增长数 A
sum(
increase(backend_team_custom_event_total{project="数智罗盘后端项目",env="prod",event="exception",result="FAILED"}[2m])
) by (project)
6.1.2 配置上个周期的异常增长数的1.1倍 B
sum(
increase(backend_team_custom_event_total{project="数智罗盘后端项目",env="prod",event="exception",result="FAILED"}[2m] offset 2m)
)by (project) * 1.1
6.1.3 配置表达式 C
取最近的一条时间序列 (当然A只有一条时间序列)
6.1.3 配置报警规则 D
最近2分钟异常增长率超过了10%并且新增异常超过了40个
($C > $B) && ($C > 40)

6.1.4 通知消息, 定制通知label

七. 配置联络点
grafana版本不同,配置钉钉群消息的方法也略有不同, 当前版本是Grafana v11.4.0 (b58701869e)
7.1 选择钉钉整合方式 & 配置URL
url为钉钉群的webhook机器人, 有安全要求,这里选择关键字匹配(告警)
7.2 维护自定义消息模版
项目: 数据罗盘后端项目
环境: {{ .CommonLabels.env }}
事件: {{ .CommonLabels.event }}
结果: {{ .CommonLabels.result }}
📝 告警摘要: {{ .CommonAnnotations.summary }}
📋 告警描述: {{ .CommonAnnotations.description }}
📊 监控状态
状态: {{ .Status }}
数量: {{ .Alerts.Firing | len }}
🚨 请及时处理!

八. 告警效果
8.1 异常增长率过高告警

8.2 定时任务执行失败告警




7244

被折叠的 条评论
为什么被折叠?



