监控告警
如何在 Pigsty 中对基础设施进行自监控?
本文介绍 Pigsty 中 INFRA 模块的监控面板与告警规则。
监控面板
Pigsty 针对 Infra 模块提供了以下监控面板:
| 面板 | 描述 |
|---|---|
| Pigsty Home | Pigsty 监控系统主页 |
| INFRA Overview | Pigsty 基础设施自监控概览 |
| Nginx Instance | Nginx 监控指标与日志 |
| Grafana Instance | Grafana 监控指标与日志 |
| VictoriaMetrics Instance | VictoriaMetrics 抓取/查询状态 |
| VMAlert Instance | 告警规则执行情况 |
| Alertmanager Instance | 告警聚合与通知 |
| VictoriaLogs Instance | 日志写入、查询与索引 |
| Logs Instance | 查阅单个节点上的日志信息 |
| VictoriaTraces Instance | Trace 存储与查询 |
| Inventory CMDB | CMDB 可视化 |
| ETCD Overview | etcd 集群监控 |
告警规则
Pigsty 针对 INFRA 模块提供了以下两条告警规则:
| 告警规则 | 描述 |
|---|---|
InfraDown | 基础设施组件出现宕机 |
AgentDown | 监控 Agent 代理出现宕机 |
可在 files/victoria/rules/infra.yml 中修改或添加新的基础设施告警规则。
告警规则配置
################################################################
# Infrastructure Alert Rules #
################################################################
- name: infra-alert
rules:
#==============================================================#
# Infra Aliveness #
#==============================================================#
# infra components (victoria,grafana) down for 1m triggers a P1 alert
- alert: InfraDown
expr: infra_up < 1
for: 1m
labels: { level: 0, severity: CRIT, category: infra }
annotations:
summary: "CRIT InfraDown {{ $labels.type }}@{{ $labels.instance }}"
description: |
infra_up[type={{ $labels.type }}, instance={{ $labels.instance }}] = {{ $value | printf "%.2f" }} < 1
#==============================================================#
# Agent Aliveness #
#==============================================================#
# agent aliveness are determined directly by exporter aliveness
# including: node_exporter, pg_exporter, pgbouncer_exporter, haproxy_exporter
- alert: AgentDown
expr: agent_up < 1
for: 1m
labels: { level: 0, severity: CRIT, category: infra }
annotations:
summary: 'CRIT AgentDown {{ $labels.ins }}@{{ $labels.instance }}'
description: |
agent_up[ins={{ $labels.ins }}, instance={{ $labels.instance }}] = {{ $value | printf "%.2f" }} < 1