Prometheus&Grafana 监控环境搭建

最新推荐文章于 2026-04-02 09:06:59 发布

原创最新推荐文章于 2026-04-02 09:06:59 发布 · 972 阅读

6 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

标签

#prometheus #grafana

运维专栏收录该内容

1 篇文章

订阅专栏

一、Prometheus 安装配置

1、下载地址：https://prometheus.io/download/，后续的各种 exporter 也在这里下载。

2、进入解压后的目录找到配置文件 prometheus.yml

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

上面是 prometheus 监控自己的一个示例，剩下的按需配置就行。

3、启动

/home/soft/prometheus/prometheus --config.file="/home/soft/prometheus/prometheus.yml" &

上面的路径按实际情况修改即可；

prometheus 默认端口是 9090，浏览器访问 http://192.168.179.185:9090/ ，菜单栏找到 Status -> Targets 即可。

4、告警配置

进入 prometheus 新建个目录 rule 和文件 node-alerts.yml

/home/soft/prometheus/rule/node-alerts.yml

修改文件内容

groups:
- name: 服务器监控指标
  rules:
  - alert: 实例存活告警
    expr: up == 0  # expr 是计算公式，up指标可以获取到当前所有运行的Exporter实例以及其状态，即告警阈值为up==0
    for: 30s       # for语句会使 Prometheus 服务等待指定的时间, 然后执行查询表达式。（for 表示告警持续的时长，若持续时长小于该时间就不发给alertmanager了，大于该时>间再发。for的值不要小于prometheus中的scrape_interval，例如scrape_interval为30s，for为15s，如果触发告警规则，则再经过for时长后也一定会告警，这是因为最新的度量指>标还没有拉取，在15s时仍会用原来值进行计算。另外，要注意的是只有在第一次触发告警时才会等待(for)时长。）
    labels:        # labels语句允许指定额外的标签列表，把它们附加在告警上。
      severity: Disaster
    annotations:   # annotations语句指定了另一组标签，它们不被当做告警实例的身份标识，它们经常用于存储一些额外的信息，用于报警信息的展示之类的。
      summary: "节点失联"


  - alert: "内存使用率告警"
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 75
    for: 1m
    labels:
      user: prometheus
      severity: warning
      db: sql
    annotations:
      summary: "服务器: {{$labels.alertname}} 内存报警"
      description: "{{ $labels.alertname }} 内存资源利用率大于75%！(当前值: {{ $value }}%)"
      value: "{{ $value }}"


  - alert: CPU使用率告警
    expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 80
    for: 1m
    labels:
      user: prometheus
      severity: warning
    annotations:
      summary: "服务器: {{$labels.alertname}} CPU报警"
      description: "服务器: CPU使用超过80%！(当前值: {{ $value }}%)"
      value: "{{ $value }}"


  - alert: 磁盘使用率告警
    expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80
    for: 1m
    labels:
      user: prometheus
      severity: warning
    annotations:
      summary: "服务器: {{$labels.alertname}} 磁盘报警"
      description: "服务器:{{$labels.alertname}},磁盘设备: 使用超过80%！(挂载点: {{ $labels.mountpoint }} 当前值: {{ $value }}%)"
      value: "{{ $value }}"


- name: 数据库监控指标
  - alert: SQL慢查询告警
    expr: rate(mysql_global_status_slow_queries[5m]) > 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.instance}}: 慢查询报警"
      description: "服务器:{labels.instance}}: MySQL慢查询数量超过阈值，请检查优化！"


  - alert: 连接数过高
    expr: mysql_global_status_threads_connected > 100
    for: 10m
    labels:
      severity: Critical
    annotations:
      summary: "连接数过高"
      description: "MySQL连接数超过100，请及时处理！"


  - alert: 死锁发生
    expr: increase(mysql_global_status_innodb_row_lock_waits[5m]) > 0
    for: 5m
    labels:
      severity: Critical
    annotations:
      summary: "死锁发生"
      description: "MySQL发生死锁，请立即处理！"
	  
	  
- name: 微服务监控指标
  rules:
  - alert: Hikari连接告警
    expr: hikaricp_connections{pool="HikariPool"} > 10
    for: 5m
    labels:
      severity: Critical
    annotations:
      summary: "Hikari连接数过高"
      description: "Hikari连接池{{ $labels.pool }}连接数超过50，请注意监控和优化！"
	  

  - alert: 请求超时告警
    expr: http_server_requests_seconds{quantile="0.95"} > 1
    for: 5m
    labels:
      severity: Warning
    annotations:
      summary: "HTTP请求处理时间过长"
      description: "95th percentile HTTP请求处理时间超过1秒 (当前值: {{ $value }})"


  - alert: 请求速率异常告警
    expr: rate(http_server_requests_seconds_count{method="GET", status="2xx"}[1m]) > 100
    for: 5m
    labels:
      severity: Warning
    annotations:
      summary: "接口请求速率异常"
      description: "GET请求成功率超过100次/分钟，请检查接口性能！"
	  
	  
  - alert: 堆内存使用量告警
    expr: jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "堆内存使用量告警"
      description: "JVM 堆内存使用率超过 80%."

重启 prometheus ，浏览器访问 http://192.168.179.185:9090/ ，菜单栏找到 Alerts 即可。

5、启动/停止脚本

startup.sh


#!/bin/bash
/home/soft/prometheus/prometheus --config.file="/home/soft/prometheus/prometheus.yml" > /dev/null 2>&1 &

stop.sh

#!/bin/bash

service_name=prometheus

pid=$(pgrep -f "${service_name}")

if [ -n "${pid}" ]
then
    echo "Stopping ${service_name}..."
    kill -15 "${pid}"
    echo "${service_name} has been stopped."
else
    echo "${service_name} is not running."
fi

二、Grafana 安装配置

下载地址：https://grafana.com/grafana/download

这个比较简单，解压进入 bin 目录直接启动，浏览器访问 http://192.168.179.185:3000/ 默认用户名密码都是 admin 。

nohup ./grafana-server > /dev/null 2>&1 &

配置 Prometheus

其他的默认直接保存就行。

导入 exporter

三、各种 exporter 应用

介绍两个比较常见的 node_exporter 和 mysqld_exporter

1、node_exporter 配置

进入解压目录直接启动即可

/home/soft/exporter/node_exporter/node_exporter &

在 prometheus 中添加 job，修改 /home/soft/prometheus/prometheus.yml，然后重启 prometheus

（如果需监控多台，在对应服务器上启node_exporter，再添加对应的JOB即可）

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node_exporter"
    static_configs:
      - targets: ["192.168.179.185:9100"]

2、mysqld_exporter 配置

进入解压目录配置 my.cnf 文件，没有的话自己建一个

[client]
port=3306
user=exporter
password=123456
host=192.168.170.243

启动 mysqld_exporter，注意这里配置的端口是 9104（如果需要监控多台，启多个脚本监听不同端口，再添加对应的JOB即可）

/home/soft/exporter/mysqld_exporter/mysqld_exporter --web.listen-address=:9104 --config.my-cnf=/home/soft/exporter/mysqld_exporter/my.cnf &

在 prometheus 中添加 job，修改 /home/soft/prometheus/prometheus.yml，然后重启 prometheus

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]


  - job_name: "mysql_exporter"
    static_configs:
      - targets: ["192.168.179.185:9104"]


  - job_name: "node_exporter"
    static_configs:
      - targets: ["192.168.179.185:9100"]

3、监控微服务配置

添加依赖

<!-- 开启springboot的应用监控 -->
<dependency>
	<groupId>org.springframework.boot</groupId>
	<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- 增加prometheus整合 -->
<dependency>
	<groupId>io.micrometer</groupId>
	<artifactId>micrometer-registry-prometheus</artifactId>
	<version>1.9.0</version>
</dependency>

修改yml

#开启SpringBoot Admin的监控
management:
  endpoints:
    promethus:
      enable: true
    web:
      exposure:
        include: '*'
  endpoint:
    health:
      show-details: always

在 prometheus 中添加 job，修改 /home/soft/prometheus/prometheus.yml，然后重启 prometheus

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "jb_service"
    scrape_interval: 1s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ["192.168.179.185:28085"]