Look into the replica sync fails
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	WMDE-Fisch
	Oct 26 2022, 6:04 PM

Description

The current theory is that the primary postgres database which is regularly updated with openstreetmap data is failing to replicate down to the secondary postgres servers. The database cluster is using streaming replication to mirror data from the primary to the read-only replicas, and that replication is getting out of sync. This makes it impossible for the replica to receive any subsequent updates, until a manual, full sync is done.

The effect we see is that some postgres replica nodes within a data center will have out-of-date shapes, while other nodes are current. The lag measured in bytes shows that 3 out of 5 nodes in codfw are more than 4TB behind: https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=16

Ensure that we have an up-to-date diagram of database instances for the kartotherian cluster and how they're related (here).
Try to validate our assumptions.
Identify what is breaking the sync.
Try to find ways to make the setup more robust T290149

Background:
https://www.cybertec-postgresql.com/en/streaming-replication-conflicts-in-postgresql/

Details

	Subject	Repo	Branch	Lines +/-
	Invite some of WMDE Tech Wishes team to poke around maps instances	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T321959 Tech Wishes - Maps service infrastructure deprecations
Resolved	None	T316365 Restore the map data health and parity between clusters
Resolved	awight	T321722 Look into the replica sync fails
Resolved	jijiki	T290149 Configure replication slots on Postgres masters

Event Timeline

WMDE-Fisch created this task.Oct 26 2022, 6:04 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2022, 6:04 PM

WMDE-Fisch moved this task from Backlog to In sprint on the WMDE-GeoInfo-FocusArea board.Oct 26 2022, 6:06 PM

awight updated the task description. (Show Details)Oct 27 2022, 1:31 PM

I've removed the point about "improve caching" because I think it's too specific. If the issue is caused by deadlock between read queries on the replica and replication writes, then it will still happen with less traffic, just at a lower rate. For example, the solution could be to change the options or index use for geoshapes queries, in which case there's no need to also reduce traffic. Improved caching can be its own maintenance task, though!

awight updated the task description. (Show Details)Oct 27 2022, 1:39 PM

Change 850160 had a related patch set uploaded (by Awight; author: Awight):

[operations/puppet@production] Invite some of WMDE Tech Wishes team to poke around maps instances

https://gerrit.wikimedia.org/r/850160

gerritbot added a project: Patch-For-Review.Oct 27 2022, 1:45 PM

dr0ptp4kt subscribed.Oct 31 2022, 6:08 PM

awight moved this task from Sprint Backlog to Watching / Epic / Stalled on the WMDE-TechWish-Sprint-2022-10-26 board.Nov 1 2022, 10:48 AM

WMDE-Fisch added a project: WMDE-TechWish-Sprint-2022-11-09.Nov 9 2022, 12:40 PM

Lena_WMDE moved this task from Sprint Backlog to Watching / Epic / Stalled on the WMDE-TechWish-Sprint-2022-11-09 board.Nov 9 2022, 12:43 PM

jijiki added a subtask: T290149: Configure replication slots on Postgres masters.Nov 18 2022, 11:52 AM

Even though we enabled replication-slots, with @Jgiannelos we noticed that maps1005 is behind

postgres lag eqiad

2022-11-17 13:48:55 GMT LOG:  received fast shutdown request
2022-11-17 13:48:55 GMT LOG:  aborting any active transactions
2022-11-17 13:48:55 GMT FATAL:  terminating connection due to administrator command
2022-11-17 13:48:55 GMT FATAL:  terminating walreceiver process due to administrator command
2022-11-17 13:48:55 GMT LOG:  shutting down
2022-11-17 13:48:55 GMT LOG:  database system is shut down
2022-11-17 13:49:14 GMT LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-11-17 13:49:14 GMT LOG:  listening on IPv6 address "::", port 5432
2022-11-17 13:49:14 GMT LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-11-17 13:49:14 GMT LOG:  database system was shut down in recovery at 2022-11-17 13:48:55 GMT
2022-11-17 13:49:14 GMT LOG:  entering standby mode
2022-11-17 13:49:14 GMT LOG:  redo starts at E92/57111668
2022-11-17 13:49:14 GMT LOG:  consistent recovery state reached at E92/57111748
2022-11-17 13:49:14 GMT LOG:  invalid resource manager ID 125 at E92/57111748
2022-11-17 13:49:14 GMT LOG:  database system is ready to accept read only connections
2022-11-17 13:49:14 GMT LOG:  started streaming WAL from primary at E92/57000000 on timeline 1
2022-11-17 13:49:15 GMT LOG:  incomplete startup packet

after doing a manual restart:

2022-11-18 13:23:33 GMT LOG:  listening on IPv6 address "::", port 5432
2022-11-18 13:23:33 GMT LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-11-18 13:23:33 GMT LOG:  database system was shut down in recovery at 2022-11-18 13:23:33 GMT
2022-11-18 13:23:33 GMT LOG:  entering standby mode
2022-11-18 13:23:33 GMT LOG:  redo starts at E92/57111668
2022-11-18 13:23:33 GMT LOG:  consistent recovery state reached at E92/57111748
2022-11-18 13:23:33 GMT LOG:  invalid resource manager ID 125 at E92/57111748
2022-11-18 13:23:33 GMT LOG:  database system is ready to accept read only connections
2022-11-18 13:23:33 GMT LOG:  started streaming WAL from primary at E92/57000000 on timeline 1
2022-11-18 13:23:33 GMT LOG:  incomplete startup packet

Will take a look with @hnowlan and see if we can figure out what went wrong

jijiki triaged this task as Medium priority.Nov 18 2022, 2:02 PM

jijiki added a subscriber: serviceops.

jijiki moved this task from Incoming 🐫 to Doing 😎 on the serviceops board.Nov 18 2022, 2:03 PM

jijiki added a project: serviceops.

jijiki updated the task description. (Show Details)Nov 18 2022, 2:04 PM

Aklapper removed a subscriber: serviceops.Nov 19 2022, 10:20 AM

Kicking this out of the Tech Wishes projects, since the stale data on codfw is now resolved, and it looks like a more stable, long-term solution is in place.

In T321722#8405630, @jijiki wrote:

Even though we enabled replication-slots, with @Jgiannelos we noticed that maps1005 is behind

postgres lag eqiad

2022-11-17 13:48:55 GMT LOG:  received fast shutdown request
2022-11-17 13:48:55 GMT LOG:  aborting any active transactions
2022-11-17 13:48:55 GMT FATAL:  terminating connection due to administrator command
2022-11-17 13:48:55 GMT FATAL:  terminating walreceiver process due to administrator command
2022-11-17 13:48:55 GMT LOG:  shutting down
2022-11-17 13:48:55 GMT LOG:  database system is shut down
2022-11-17 13:49:14 GMT LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-11-17 13:49:14 GMT LOG:  listening on IPv6 address "::", port 5432
2022-11-17 13:49:14 GMT LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-11-17 13:49:14 GMT LOG:  database system was shut down in recovery at 2022-11-17 13:48:55 GMT
2022-11-17 13:49:14 GMT LOG:  entering standby mode
2022-11-17 13:49:14 GMT LOG:  redo starts at E92/57111668
2022-11-17 13:49:14 GMT LOG:  consistent recovery state reached at E92/57111748
2022-11-17 13:49:14 GMT LOG:  invalid resource manager ID 125 at E92/57111748
2022-11-17 13:49:14 GMT LOG:  database system is ready to accept read only connections
2022-11-17 13:49:14 GMT LOG:  started streaming WAL from primary at E92/57000000 on timeline 1
2022-11-17 13:49:15 GMT LOG:  incomplete startup packet

after doing a manual restart:

2022-11-18 13:23:33 GMT LOG:  listening on IPv6 address "::", port 5432
2022-11-18 13:23:33 GMT LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-11-18 13:23:33 GMT LOG:  database system was shut down in recovery at 2022-11-18 13:23:33 GMT
2022-11-18 13:23:33 GMT LOG:  entering standby mode
2022-11-18 13:23:33 GMT LOG:  redo starts at E92/57111668
2022-11-18 13:23:33 GMT LOG:  consistent recovery state reached at E92/57111748
2022-11-18 13:23:33 GMT LOG:  invalid resource manager ID 125 at E92/57111748
2022-11-18 13:23:33 GMT LOG:  database system is ready to accept read only connections
2022-11-18 13:23:33 GMT LOG:  started streaming WAL from primary at E92/57000000 on timeline 1
2022-11-18 13:23:33 GMT LOG:  incomplete startup packet

Will take a look with @hnowlan and see if we can figure out what went wrong

It appeared that maps1005 was stuck in a previous LSN, reporting fewer bytes than received.

We ran a SELECT pg_switch_wal(); which forced a WAL change, and the problem seems to have gone away .

@awight mind if we close this task and revisit if we are having replication issues?

In T321722#8409135, @jijiki wrote:

@awight mind if we close this task and revisit if we are having replication issues?

Works for us—thanks again!

	F35794699: image.png
	Nov 18 2022, 2:02 PM

Look into the replica sync failsClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Look into the replica sync fails
Closed, ResolvedPublic
Actions

Related Objects
Search...