- 资源准备
- 监控端,系统Ubuntu20.04,AMD EPYC 7551,1h1g(8.210.17.226)
- 被监控端,系统Debian10,ARM,2h12g(47.242.196.53)
- 安装前需知
- 2.1 监控端需安装服务:
- Prometheus
- Node Exporter
- Grafana
- Alertmanager
- webhook-adapter
- 2.2 被监控端需安装服务:
- Node Exporter
- 2.3 Linux系统影响服务相关:
- 时间同步
- 防火墙和selinux
- 本次通过容器安装相关服务,需安装docker
- docker安装
- 部署具体组件
- 4.1 安装 Node Exporter
- 安装命令:
docker pull prom/node-exporter:latest
- 安装命令:
- 制作启动脚本:
vi node-export-start.sh
- 制作启动脚本:
docker run -d -p 9100:9100 \
-v "/proc:/host/proc" \
-v "/sys:/host/sys" \
-v "/:/rootfs" \
-v "/etc/localtime:/etc/localtime" \
prom/node-exporter \
--path.procfs /host/proc \
--path.sysfs /host/sys \
--collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"
- 启动 Node Exporter:
./node-export-start.sh
- 启动 Node Exporter:
- 验证 Node Exporter是否启动成功:访问http://8.210.17.226:9100/metrics
- 4.2 安装 Prometheus
- 安装命令:
docker pull prom/prometheus
- 安装命令:
- 制作启动脚本:
vi prometheus-start.sh
- 制作启动脚本:
docker run -d \
-p 9090:9090 \
-v /home/docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /home/docker/prometheus/rules:/etc/prometheus/rules \
prom/prometheus
- 在终端执行:
mkdir -p /home/docker/prometheus
vi prometheus.yml
- 粘贴配置文件:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
alerting:
alertmanagers:
- static_configs:
- targets:
- 8.210.17.226:9093
rule_files:
- "rules/*.yml"
scrape_configs:
# 配置监控的 Job
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
static_configs:
- targets: ['localhost:9090']
labels:
serviceId: prometheus
serviceName: 普罗米修斯
- job_name: “node_exporter”
# metrics_path defaults to '/metrics'
static_configs:
- targets: ['10.0.0.240:9100','47.242.196.53:9100']
- 启动 Prometheus:
./prometheus-start.sh
- 启动 Prometheus:
- 验证 Prometheus是否启动成功:访问http://8.210.17.226:9090/targets
- 4.3 安装 Grafana
- 安装命令:
docker pull grafana/grafana:latest
- 安装命令:
- 制作启动脚本:
vi grafana-start.sh
- 制作启动脚本:
docker run -d -i -p 3000:3000 \
-v "/etc/localtime:/etc/localtime" \
-e "GF_SERVER_ROOT_URL=http://grafana.server.name" \
-e "GF_SECURITY_ADMIN_PASSWORD=studygolang" \
grafana/grafana
- 启动 Grafana:
./grafana-start.sh
- 启动 Grafana:
- 验证 Grafana是否启动成功:访问http://8.210.17.226:3000/metrics 或者 http://8.210.17.226:3000,用户名:admin,密码:studygolang
- 4.4 安装 Alertmanager
- 安装命令:
docker pull prom/alertmanager:latest
- 安装命令:
- 启动服务:
docker run --name alertmanager -d -p 9093:9093 --restart=always \
prom/alertmanager
- 从容器内获取配置文件:
docker cp alertmanager:/etc/alertmanager /home/docker
- 从容器内获取配置文件:
- 删除容器制作启动脚本:
vi alertmanager-start.sh
- 删除容器制作启动脚本:
docker run --name alertmanager -d -p 9093:9093 --restart=always \
-v /home/docker/alertmanager/:/etc/alertmanager/ \
prom/alertmanager
- 4.5 安装 webhook-adapter
- 安装命令:
docker pull guyongquan/webhook-adapter:latest
- 安装命令:
- 制作启动脚本:
vi webhook-adapter-start.sh
- 制作启动脚本:
docker run --name webhook-adapter -p 8080:80 -d guyongquan/webhook-adapter --adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=(企业微信群机器人key)
- Grafana使用
- 5.1 添加Prometheus数据源
- 修改 Prometheus Alertmanager 配置项
- 6.1 修改 Prometheus配置文件
prometheus.yml
- 终端执行:
cd /home/docker/prometheus
vim prometheus.yml
- 修改配置文件:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
alerting:
alertmanagers:
- static_configs:
- targets:
- 8.210.17.226:9093
rule_files:
- "rules/*.yml" #启动prometheus必须挂载rules目录,否则读取不到该配置
scrape_configs:
# 配置监控的 Job
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
static_configs:
- targets: ['localhost:9090']
labels:
serviceId: prometheus
serviceName: 普罗米修斯
- job_name: "cloudcone"
# metrics_path defaults to '/metrics'
static_configs:
- targets: ['8.210.17.226:9100']
- job_name: "oracle_SanJose_ARM"
# metrics_path defaults to '/metrics'
static_configs:
- targets: ['47.242.196.53:9100']
- 6.2 在 Prometheus
rules
文件夹下创建配置文件 memory_over.yml
:
groups:
- name: 内存报警规则
rules:
- alert: 内存使用率告警
expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "服务器可用内存不足。"
description: "内存使用率已超过50%(当前值:{{ $value }}%)"
disk_over.yml
:
groups:
- name: 磁盘使用率报警规则
rules:
- alert: 磁盘使用率告警
expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80
for: 20m
labels:
severity: warning
annotations:
summary: "硬盘分区使用率过高"
description: "分区使用大于80%(当前值:{{ $value }}%)"
cpu_over.yml
:
groups:
- name: CPU报警规则
rules:
- alert: CPU使用率告警
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 50
for: 1m
labels:
severity: warning
annotations:
summary: "CPU使用率正在飙升。"
description: "CPU使用率超过50%(当前值:{{ $value }}%)"
node_alived.yml
:
groups:
- name: 实例存活告警规则
rules:
- alert: 实例存活告警
expr: up == 0
for: 1m
labels:
severity: warning
annotations:
summary: "主机宕机!!!"
description: "该实例主机已经宕机超过一分钟了."
- 6.3 修改 Alertmanager 配置文件
alertmanager.yml
- 终端执行:
cd /home/docker/alertmanager
vim alertmanager.yml
alertmanager.yml
:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- send_resolved: true
url: 'http://8.210.17.226:8080/adapter/wx'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
- 待完善
- 当有新机器需要被监控时,监控端配置文件更新后必须重启服务
- 容器内信息未收集(cAdvisor可实现,暂时不想弄)
- 目前监控端资源压力有点大,内存占用都在70%或更多
Grafana未使用https (没啥用)服务器异常报警推送到微信- 其他有待继续发现
参考文章:
Hi there! I jst wanted to ask iif you eber have any problems
wth hackers? My ast blog (wordpress) was hacked and I enfed uup losing a few months
of hzrd worek ddue to no dat backup. Do you hae any solutions to stop hackers?