Monitoring

Etcd monitoring metrics, dashboards, and alerting rules

Dashboard

The ETCD module provides a monitoring dashboard: Etcd Overview.

ETCD Overview Dashboard

ETCD Overview: Overview of the ETCD cluster

This dashboard provides key information about the ETCD status, with the most notable being ETCD Aliveness, which displays the overall service status of the ETCD cluster.

Red bands indicate periods when instances are unavailable, while the blue-gray bands below show when the entire cluster is unavailable.

etcd-overview.jpg


Alert Rules

Pigsty provides the following five preset alert rules for Etcd, defined in files/prometheus/rules/etcd.yml:

  • EtcdServerDown: Etcd node down, critical alert
  • EtcdNoLeader: Etcd cluster has no leader, critical alert
  • EtcdQuotaFull: Etcd quota usage exceeds 90%, warning
  • EtcdNetworkPeerRTSlow: Etcd network latency is slow, notice
  • EtcdWalFsyncSlow: Etcd disk fsync is slow, notice
#==============================================================#
#                         Aliveness                            #
#==============================================================#
# etcd server instance down
- alert: EtcdServerDown
  expr: etcd_up < 1
  for: 1m
  labels: { level: 0, severity: CRIT, category: etcd }
  annotations:
    summary: "CRIT EtcdServerDown {{ $labels.ins }}@{{ $labels.instance }}"
    description: |
      etcd_up[ins={{ $labels.ins }}, instance={{ $labels.instance }}] = {{ $value }} < 1
      http://g.pigsty/d/etcd-overview      

#==============================================================#
#                         Error                                #
#==============================================================#
# Etcd no Leader triggers a P0 alert immediately
# if dcs_failsafe mode is not enabled, this may lead to global outage
- alert: EtcdNoLeader
  expr: min(etcd_server_has_leader) by (cls) < 1
  for: 15s
  labels: { level: 0, severity: CRIT, category: etcd }
  annotations:
    summary: "CRIT EtcdNoLeader: {{ $labels.cls }} {{ $value }}"
    description: |
      etcd_server_has_leader[cls={{ $labels.cls }}] = {{ $value }} < 1
      http://g.pigsty/d/etcd-overview?from=now-5m&to=now&var-cls={{$labels.cls}}      

#==============================================================#
#                        Saturation                            #
#==============================================================#
- alert: EtcdQuotaFull
  expr: etcd:cls:quota_usage > 0.90
  for: 1m
  labels: { level: 1, severity: WARN, category: etcd }
  annotations:
    summary: "WARN EtcdQuotaFull: {{ $labels.cls }}"
    description: |
      etcd:cls:quota_usage[cls={{ $labels.cls }}] = {{ $value | printf "%.3f" }} > 90%      

#==============================================================#
#                         Latency                              #
#==============================================================#
# etcd network peer rt p95 > 200ms for 1m
- alert: EtcdNetworkPeerRTSlow
  expr: etcd:ins:network_peer_rt_p95_5m > 0.200
  for: 1m
  labels: { level: 2, severity: INFO, category: etcd }
  annotations:
    summary: "INFO EtcdNetworkPeerRTSlow: {{ $labels.cls }} {{ $labels.ins }}"
    description: |
      etcd:ins:network_peer_rt_p95_5m[cls={{ $labels.cls }}, ins={{ $labels.ins }}] = {{ $value }} > 200ms
      http://g.pigsty/d/etcd-instance?from=now-10m&to=now&var-cls={{ $labels.cls }}      

# Etcd wal fsync rt p95 > 50ms
- alert: EtcdWalFsyncSlow
  expr: etcd:ins:wal_fsync_rt_p95_5m > 0.050
  for: 1m
  labels: { level: 2, severity: INFO, category: etcd }
  annotations:
    summary: "INFO EtcdWalFsyncSlow: {{ $labels.cls }} {{ $labels.ins }}"
    description: |
      etcd:ins:wal_fsync_rt_p95_5m[cls={{ $labels.cls }}, ins={{ $labels.ins }}] = {{ $value }} > 50ms
      http://g.pigsty/d/etcd-instance?from=now-10m&to=now&var-cls={{ $labels.cls }}      

Last modified 2024-11-22: update minio/etcd docs (345b8442)