Common Failures

Common Failures and diagnose

Hardware Faults

ID Name Symptom Process
H01 Primary node down pg_up = 0 for 1-3 minutes No immediate intervention required.
Additional examples after the fact
Removal from Access Domain
Execute Case 8: Cluster Role Adjustment
H02 Replica node down pg_up = 0 for 1-3 minutes No immediate intervention required
Adding examples after the fact.
Removal from Access Domain
Execute Case 8: Cluster Role Adjustment
H03 Primary node network partition Loss of all monitoring data of the primary instance, network unreachable Confirm Failover status
Force Fencing old primary if necessary
H04 Replica Node Network Partitioning Loss of all monitoring data from the instance, network unreachable Usually no effect, waiting for recovery
Contact O&M and network engineers to handle
H05 TCP the retransmission rate is too high TCP Retrans stay high for a long time, a lot of Conn Reset, a lot of query requests fail Find O&M and network engineers to handle
H06 Node memory error EDAC counter growth, system error log After confirming that there are no errors in the replica memory
Execute Case 10: cluster primary-replica switch
H07 Bad blocks on disk, data corruption Query results, and logs show error messages such as can’t read block Execute Case 10: cluster primary-replica switch
Manual data recovery using data recovery tools
R01 High CPU usage CPU / load / pressure index high topCheck for large CPU footprint programs and clean them up
As in the case of an avalanche, execute a kill query stop.
R02 OOM appears Process Failure appears, OOM message, high memory usage, start using SWAP Confirm memory, confirm SWAP
topCheck for large memory hogs and clean them up
Re-pulling the killed process
Emergency SWAP partition addition
R03 Disk Full Disk Write Full
Database Crash
A large number of shell commands cannot be executed
Remove /pg/dummy to free up emergency space
Check and handle WAL buildup
Check aa and process a large number of Log files
Confirm whether the business has cleanable data
R06 Disk/network card IO too high Disk/NIC BandWidth too large
Disk > 2GB/s
Network > 1 GB/s
Check applications that use the network/disk, such as backups, to add speed limits.

Software Errors

ID Name Symptom Process
SP1 Database process abort ps aux can’t find the postgres process Check Postgres, Patroni status
Confirm Failover results, or perform Failover manually
SP2 Connection pool process aborted systemctl status pgbouncer Failure restart service component or reset service component
SP3 Primary Patroni process aborted systemctl status patroni Failure As above, enter maintenance mode, reboot or reset Patroni
SP4 Primary Consul process aborted systemctl status consul Failure As above, enter maintenance mode, reboot or reset Consul
S05 HAProxy process aborts systemctl status haproxy Failure As above, restart or reset Haproxy
S06 Connection pool contamination An error message similar to Cannot execute XXX on read-only transactions appears Restart the Pgbouncer connection pool
or configure server_reset_query
S07 Connection pool cannot connect to the database pgbouncer can not connect to server Check whether the user, password, and HBA configuration are correct
Execute Case-4: Cluster Service User Creation to refresh the user
S08 Connection pool reaches QPS bottleneck PGbouncer QPS reaches 3 to 4W, CPU usage reaches 100% Use multiple Pgbouncers (not recommended)
Use Default service to bypass Pgbouncer
Notify business side of speed limit
S09 DCS Server is not available In auto-switchover mode, all primary will go to the unwritable state after TTL Set all clusters to maintenance mode immediately
S10 DCS Agavailableavailable If it is a replica, it has no effect, if it is a primary, it will be demoted to a replica and the cluster is not writable Set all clusters to maintenance mode immediately
S11 XID Wraparound Enter protection mode when age remaining 1000w. This problem should be avoided in advance through monitoring
locate the over-aged databases and tables, perform emergency cleaning
quickly locate the cause of blocking the vacuum and solve
restore in single user mode
S12 WAL Stacking WAL size continues to grow Execute CHECKPOINTmultiple times
confirm the wal archive status
confirm whether there are unfinished ultra-longg transactions from the replica
confirm whether there are replication slots to prevent wal recycling

Human Errors

ID Name Symptom Process
M01 Mistakenly deleted database clusters The database cluster is gone Use cold standby to recover the cluster
Prepare to run
M02 Mistakenly elevating an instance to the primary split-brain No need to handle it in automatic mode, otherwise, split-brain
M03 Erased data by mistake The data is gone Stop vacuum, use PG_ Dirtyread extract
extract from delayed cluster
extract and restore from cold standby
M04 Erasure Form The table is gone Fetch from delayed cluster
Fetch and restore from cold standby
M05 Integer Sequence Number Overflow Sequence exceeds INTMAX Refer to integer primary key online upgrade manual to handle
M06 Insert data conflicts due to duplicate primary key serial numbers violate constratint … Grow serial number value (e.g. +100000)
M07 Slow query queuing / avalanche Large number of slow query logs Use pg_terminate_backend to periodically clean up slow queries (e.g. every 1 second)
M08 Deadlock queuing / avalanche Lock stacking Use pg_terminate_backend to periodically clean up queries (e.g. every 1 second)
M09 HBA denied access no HBA entry for xxx Case 6: APPLY-PGSQL-HBA
M10 User password error password auth failure for xxx Case 4: Create OGSQL Biz User
M11 Insufficient access privileges permission denied for x Check if the user created the object with the correct admin
Refer to Default Privilege to manually fix the object privileges

Last modified 2022-06-04: fill en docs (5a858d3)