Page MenuHomePhabricator

MAPS osm replication lag critical in eqiad
Closed, ResolvedPublic

Description

PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 745563800 and 160 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 489235888 and 170 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 144477792 and 292 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 286726232 and 312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 308647632 and 328 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1628094496 and 128 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2020-12-04T01:04:04Z] <ryankemper> T269406 https://grafana.wikimedia.org/d/000000305/maps-performances?viewPanel=11&orgId=1&var-cluster=maps1&from=1606827063027&to=1607043666975 shows that the normal daily dropoff in lag did not occur today, leading to the criticals. It's possible some sort of daily job has failed

@RKemper could T268927: Some PostgreSQL replicas are not fully updated be related to this ticket?

Also, the OSM sync lag is not the same as Postgres Replcation Lag. OSM sync lag means how much lag there is between maps1004 (which is the main DB) and upstream OSM data, although the monitoring data might coincide.

This issue is caused by the increased number of replicas in both datacentres. When the OSM sync happens, the lag jumps due to the large import. I'm currently tweaking the alerting thresholds to try to stop this from spamming.