PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 745563800 and 160 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 489235888 and 170 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 144477792 and 292 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 286726232 and 312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 308647632 and 328 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1628094496 and 128 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
Description
Related Objects
- Mentioned Here
- T268927: Some PostgreSQL replicas are not fully updated
Event Timeline
Mentioned in SAL (#wikimedia-operations) [2020-12-04T01:04:04Z] <ryankemper> T269406 https://grafana.wikimedia.org/d/000000305/maps-performances?viewPanel=11&orgId=1&var-cluster=maps1&from=1606827063027&to=1607043666975 shows that the normal daily dropoff in lag did not occur today, leading to the criticals. It's possible some sort of daily job has failed
Not sure what the problem was but it recovered on its own: https://grafana.wikimedia.org/d/000000305/maps-performances?viewPanel=11&orgId=1&var-cluster=maps1&from=1606950068010&to=1607044444481
@RKemper could T268927: Some PostgreSQL replicas are not fully updated be related to this ticket?
Also, the OSM sync lag is not the same as Postgres Replcation Lag. OSM sync lag means how much lag there is between maps1004 (which is the main DB) and upstream OSM data, although the monitoring data might coincide.
This issue is caused by the increased number of replicas in both datacentres. When the OSM sync happens, the lag jumps due to the large import. I'm currently tweaking the alerting thresholds to try to stop this from spamming.