MAPS osm replication lag critical in eqiad
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	RKemper
	Dec 4 2020, 1:02 AM

Description

PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 745563800 and 160 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 489235888 and 170 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 144477792 and 292 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 286726232 and 312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 308647632 and 328 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1628094496 and 128 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring

Related Objects

Mentioned Here: T268927: Some PostgreSQL replicas are not fully updated

Event Timeline

RKemper created this task.Dec 4 2020, 1:02 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 4 2020, 1:02 AM

Mentioned in SAL (#wikimedia-operations) [2020-12-04T01:04:04Z] <ryankemper> T269406 https://grafana.wikimedia.org/d/000000305/maps-performances?viewPanel=11&orgId=1&var-cluster=maps1&from=1606827063027&to=1607043666975 shows that the normal daily dropoff in lag did not occur today, leading to the criticals. It's possible some sort of daily job has failed

Not sure what the problem was but it recovered on its own: https://grafana.wikimedia.org/d/000000305/maps-performances?viewPanel=11&orgId=1&var-cluster=maps1&from=1606950068010&to=1607044444481

RKemper closed this task as Resolved.Dec 4 2020, 1:15 AM

@RKemper could T268927: Some PostgreSQL replicas are not fully updated be related to this ticket?

Also, the OSM sync lag is not the same as Postgres Replcation Lag. OSM sync lag means how much lag there is between maps1004 (which is the main DB) and upstream OSM data, although the monitoring data might coincide.

This issue is caused by the increased number of replicas in both datacentres. When the OSM sync happens, the lag jumps due to the large import. I'm currently tweaking the alerting thresholds to try to stop this from spamming.

MAPS osm replication lag critical in eqiadClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

MAPS osm replication lag critical in eqiad
Closed, ResolvedPublic
Actions