Monitoring
Etcd monitoring metrics, dashboards, and alerting rules
Module:
Categories:
Dashboard
The ETCD module provides a monitoring dashboard: Etcd Overview.
ETCD Overview Dashboard
ETCD Overview: Overview of the ETCD cluster
This dashboard provides key information about the ETCD status, with the most notable being ETCD Aliveness, which displays the overall service status of the ETCD cluster.
Red bands indicate periods when instances are unavailable, while the blue-gray bands below show when the entire cluster is unavailable.
Alert Rules
Pigsty provides the following five preset alert rules for Etcd, defined in files/prometheus/rules/etcd.yml
:
EtcdServerDown
: Etcd node down, critical alertEtcdNoLeader
: Etcd cluster has no leader, critical alertEtcdQuotaFull
: Etcd quota usage exceeds 90%, warningEtcdNetworkPeerRTSlow
: Etcd network latency is slow, noticeEtcdWalFsyncSlow
: Etcd disk fsync is slow, notice
#==============================================================#
# Aliveness #
#==============================================================#
# etcd server instance down
- alert: EtcdServerDown
expr: etcd_up < 1
for: 1m
labels: { level: 0, severity: CRIT, category: etcd }
annotations:
summary: "CRIT EtcdServerDown {{ $labels.ins }}@{{ $labels.instance }}"
description: |
etcd_up[ins={{ $labels.ins }}, instance={{ $labels.instance }}] = {{ $value }} < 1
http://g.pigsty/d/etcd-overview
#==============================================================#
# Error #
#==============================================================#
# Etcd no Leader triggers a P0 alert immediately
# if dcs_failsafe mode is not enabled, this may lead to global outage
- alert: EtcdNoLeader
expr: min(etcd_server_has_leader) by (cls) < 1
for: 15s
labels: { level: 0, severity: CRIT, category: etcd }
annotations:
summary: "CRIT EtcdNoLeader: {{ $labels.cls }} {{ $value }}"
description: |
etcd_server_has_leader[cls={{ $labels.cls }}] = {{ $value }} < 1
http://g.pigsty/d/etcd-overview?from=now-5m&to=now&var-cls={{$labels.cls}}
#==============================================================#
# Saturation #
#==============================================================#
- alert: EtcdQuotaFull
expr: etcd:cls:quota_usage > 0.90
for: 1m
labels: { level: 1, severity: WARN, category: etcd }
annotations:
summary: "WARN EtcdQuotaFull: {{ $labels.cls }}"
description: |
etcd:cls:quota_usage[cls={{ $labels.cls }}] = {{ $value | printf "%.3f" }} > 90%
#==============================================================#
# Latency #
#==============================================================#
# etcd network peer rt p95 > 200ms for 1m
- alert: EtcdNetworkPeerRTSlow
expr: etcd:ins:network_peer_rt_p95_5m > 0.200
for: 1m
labels: { level: 2, severity: INFO, category: etcd }
annotations:
summary: "INFO EtcdNetworkPeerRTSlow: {{ $labels.cls }} {{ $labels.ins }}"
description: |
etcd:ins:network_peer_rt_p95_5m[cls={{ $labels.cls }}, ins={{ $labels.ins }}] = {{ $value }} > 200ms
http://g.pigsty/d/etcd-instance?from=now-10m&to=now&var-cls={{ $labels.cls }}
# Etcd wal fsync rt p95 > 50ms
- alert: EtcdWalFsyncSlow
expr: etcd:ins:wal_fsync_rt_p95_5m > 0.050
for: 1m
labels: { level: 2, severity: INFO, category: etcd }
annotations:
summary: "INFO EtcdWalFsyncSlow: {{ $labels.cls }} {{ $labels.ins }}"
description: |
etcd:ins:wal_fsync_rt_p95_5m[cls={{ $labels.cls }}, ins={{ $labels.ins }}] = {{ $value }} > 50ms
http://g.pigsty/d/etcd-instance?from=now-10m&to=now&var-cls={{ $labels.cls }}
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.