效果展示1
效果展示2
微信告警

  1. 资源准备
  • 监控端,系统Ubuntu20.04,AMD EPYC 7551,1h1g(8.210.17.226)
  • 被监控端,系统Debian10,ARM,2h12g(47.242.196.53)
  1. 安装前需知
  • 2.1 监控端需安装服务:
    • Prometheus
    • Node Exporter
    • Grafana
    • Alertmanager
    • webhook-adapter
  • 2.2 被监控端需安装服务:
    • Node Exporter
  • 2.3 Linux系统影响服务相关:
    • 时间同步
    • 防火墙和selinux
    • 本次通过容器安装相关服务,需安装docker
  1. docker安装
  1. 部署具体组件
  • 4.1 安装 Node Exporter
    • 安装命令: docker pull prom/node-exporter:latest
    • 制作启动脚本: vi node-export-start.sh
docker run -d -p 9100:9100 \
-v "/proc:/host/proc" \
-v "/sys:/host/sys" \
-v "/:/rootfs" \
-v "/etc/localtime:/etc/localtime" \
prom/node-exporter \
--path.procfs /host/proc \
--path.sysfs /host/sys \
--collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"
    • 启动 Node Exporter: ./node-export-start.sh
  • 4.2 安装 Prometheus
    • 安装命令: docker pull prom/prometheus
    • 制作启动脚本: vi prometheus-start.sh
docker run -d \
    -p 9090:9090 \
    -v /home/docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
    -v /home/docker/prometheus/rules:/etc/prometheus/rules \
    prom/prometheus
    • 在终端执行:
mkdir -p /home/docker/prometheus
vi prometheus.yml
    • 粘贴配置文件:
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 8.210.17.226:9093

rule_files:
  - "rules/*.yml"

scrape_configs:
  # 配置监控的 Job
  - job_name: "prometheus"
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          serviceId: prometheus
          serviceName: 普罗米修斯
  - job_name: “node_exporter”
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ['10.0.0.240:9100','47.242.196.53:9100']
  • 4.3 安装 Grafana
    • 安装命令: docker pull grafana/grafana:latest
    • 制作启动脚本: vi grafana-start.sh
docker run -d -i -p 3000:3000 \
-v "/etc/localtime:/etc/localtime" \
-e "GF_SERVER_ROOT_URL=http://grafana.server.name" \
-e "GF_SECURITY_ADMIN_PASSWORD=studygolang" \
grafana/grafana
  • 4.4 安装 Alertmanager
    • 安装命令: docker pull prom/alertmanager:latest
    • 启动服务:
docker run --name alertmanager -d -p 9093:9093 --restart=always \
prom/alertmanager
    • 从容器内获取配置文件: docker cp alertmanager:/etc/alertmanager /home/docker
    • 删除容器制作启动脚本: vi alertmanager-start.sh
docker run --name alertmanager -d -p 9093:9093 --restart=always \
-v /home/docker/alertmanager/:/etc/alertmanager/ \
prom/alertmanager
  • 4.5 安装 webhook-adapter
    • 安装命令: docker pull guyongquan/webhook-adapter:latest
    • 制作启动脚本: vi webhook-adapter-start.sh
docker run --name webhook-adapter -p 8080:80 -d guyongquan/webhook-adapter --adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=(企业微信群机器人key)
  1. Grafana使用
  1. 修改 Prometheus Alertmanager 配置项
  • 6.1 修改 Prometheus配置文件 prometheus.yml
    • 终端执行:
cd /home/docker/prometheus
vim prometheus.yml
    • 修改配置文件:
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 8.210.17.226:9093

rule_files:
  - "rules/*.yml" #启动prometheus必须挂载rules目录,否则读取不到该配置

scrape_configs:
  # 配置监控的 Job
  - job_name: "prometheus"
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          serviceId: prometheus
          serviceName: 普罗米修斯
  - job_name: "cloudcone"
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ['8.210.17.226:9100']
  - job_name: "oracle_SanJose_ARM"
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ['47.242.196.53:9100']
  • 6.2 在 Prometheus rules文件夹下创建配置文件
    • memory_over.yml
groups:
- name: 内存报警规则
  rules:
  - alert: 内存使用率告警
    expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 80
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "服务器可用内存不足。"
      description: "内存使用率已超过50%(当前值:{{ $value }}%)"
    • disk_over.yml
groups:
- name: 磁盘使用率报警规则
  rules:
  - alert: 磁盘使用率告警
    expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80
    for: 20m
    labels:
      severity: warning
    annotations:
      summary: "硬盘分区使用率过高"
      description: "分区使用大于80%(当前值:{{ $value }}%)"
    • cpu_over.yml
groups:
- name: CPU报警规则
  rules:
  - alert: CPU使用率告警
    expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 50
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "CPU使用率正在飙升。"
      description: "CPU使用率超过50%(当前值:{{ $value }}%)"
    • node_alived.yml
groups:
- name: 实例存活告警规则
  rules:
  - alert: 实例存活告警
    expr: up == 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "主机宕机!!!"
      description: "该实例主机已经宕机超过一分钟了."
  • 6.3 修改 Alertmanager 配置文件 alertmanager.yml
    • 终端执行:
cd /home/docker/alertmanager
vim alertmanager.yml
    • alertmanager.yml
global:
  resolve_timeout: 5m
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - send_resolved: true
    url: 'http://8.210.17.226:8080/adapter/wx'    
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
  1. 待完善
  • 当有新机器需要被监控时,监控端配置文件更新后必须重启服务
  • 容器内信息未收集(cAdvisor可实现,暂时不想弄)
  • 目前监控端资源压力有点大,内存占用都在70%或更多
  • Grafana未使用https (没啥用)
  • 服务器异常报警推送到微信
  • 其他有待继续发现

参考文章: