Page MenuHomePhabricator

Look into the replica sync fails
Closed, ResolvedPublic

Description

The current theory is that the primary postgres database which is regularly updated with openstreetmap data is failing to replicate down to the secondary postgres servers. The database cluster is using streaming replication to mirror data from the primary to the read-only replicas, and that replication is getting out of sync. This makes it impossible for the replica to receive any subsequent updates, until a manual, full sync is done.

The effect we see is that some postgres replica nodes within a data center will have out-of-date shapes, while other nodes are current. The lag measured in bytes shows that 3 out of 5 nodes in codfw are more than 4TB behind: https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=16

  • Ensure that we have an up-to-date diagram of database instances for the kartotherian cluster and how they're related (here).
  • Try to validate our assumptions.
  • Identify what is breaking the sync.
  • Try to find ways to make the setup more robust T290149

Background:
https://www.cybertec-postgresql.com/en/streaming-replication-conflicts-in-postgresql/

Event Timeline

I've removed the point about "improve caching" because I think it's too specific. If the issue is caused by deadlock between read queries on the replica and replication writes, then it will still happen with less traffic, just at a lower rate. For example, the solution could be to change the options or index use for geoshapes queries, in which case there's no need to also reduce traffic. Improved caching can be its own maintenance task, though!

Change 850160 had a related patch set uploaded (by Awight; author: Awight):

[operations/puppet@production] Invite some of WMDE Tech Wishes team to poke around maps instances

https://gerrit.wikimedia.org/r/850160

Even though we enabled replication-slots, with @Jgiannelos we noticed that maps1005 is behind

image.png (1×3 px, 415 KB)

postgres lag eqiad

2022-11-17 13:48:55 GMT LOG:  received fast shutdown request
2022-11-17 13:48:55 GMT LOG:  aborting any active transactions
2022-11-17 13:48:55 GMT FATAL:  terminating connection due to administrator command
2022-11-17 13:48:55 GMT FATAL:  terminating walreceiver process due to administrator command
2022-11-17 13:48:55 GMT LOG:  shutting down
2022-11-17 13:48:55 GMT LOG:  database system is shut down
2022-11-17 13:49:14 GMT LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-11-17 13:49:14 GMT LOG:  listening on IPv6 address "::", port 5432
2022-11-17 13:49:14 GMT LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-11-17 13:49:14 GMT LOG:  database system was shut down in recovery at 2022-11-17 13:48:55 GMT
2022-11-17 13:49:14 GMT LOG:  entering standby mode
2022-11-17 13:49:14 GMT LOG:  redo starts at E92/57111668
2022-11-17 13:49:14 GMT LOG:  consistent recovery state reached at E92/57111748
2022-11-17 13:49:14 GMT LOG:  invalid resource manager ID 125 at E92/57111748
2022-11-17 13:49:14 GMT LOG:  database system is ready to accept read only connections
2022-11-17 13:49:14 GMT LOG:  started streaming WAL from primary at E92/57000000 on timeline 1
2022-11-17 13:49:15 GMT LOG:  incomplete startup packet

after doing a manual restart:

2022-11-18 13:23:33 GMT LOG:  listening on IPv6 address "::", port 5432
2022-11-18 13:23:33 GMT LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-11-18 13:23:33 GMT LOG:  database system was shut down in recovery at 2022-11-18 13:23:33 GMT
2022-11-18 13:23:33 GMT LOG:  entering standby mode
2022-11-18 13:23:33 GMT LOG:  redo starts at E92/57111668
2022-11-18 13:23:33 GMT LOG:  consistent recovery state reached at E92/57111748
2022-11-18 13:23:33 GMT LOG:  invalid resource manager ID 125 at E92/57111748
2022-11-18 13:23:33 GMT LOG:  database system is ready to accept read only connections
2022-11-18 13:23:33 GMT LOG:  started streaming WAL from primary at E92/57000000 on timeline 1
2022-11-18 13:23:33 GMT LOG:  incomplete startup packet

Will take a look with @hnowlan and see if we can figure out what went wrong

jijiki triaged this task as Medium priority.Nov 18 2022, 2:02 PM
jijiki added a subscriber: serviceops.

Kicking this out of the Tech Wishes projects, since the stale data on codfw is now resolved, and it looks like a more stable, long-term solution is in place.

Even though we enabled replication-slots, with @Jgiannelos we noticed that maps1005 is behind

image.png (1×3 px, 415 KB)

postgres lag eqiad

2022-11-17 13:48:55 GMT LOG:  received fast shutdown request
2022-11-17 13:48:55 GMT LOG:  aborting any active transactions
2022-11-17 13:48:55 GMT FATAL:  terminating connection due to administrator command
2022-11-17 13:48:55 GMT FATAL:  terminating walreceiver process due to administrator command
2022-11-17 13:48:55 GMT LOG:  shutting down
2022-11-17 13:48:55 GMT LOG:  database system is shut down
2022-11-17 13:49:14 GMT LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-11-17 13:49:14 GMT LOG:  listening on IPv6 address "::", port 5432
2022-11-17 13:49:14 GMT LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-11-17 13:49:14 GMT LOG:  database system was shut down in recovery at 2022-11-17 13:48:55 GMT
2022-11-17 13:49:14 GMT LOG:  entering standby mode
2022-11-17 13:49:14 GMT LOG:  redo starts at E92/57111668
2022-11-17 13:49:14 GMT LOG:  consistent recovery state reached at E92/57111748
2022-11-17 13:49:14 GMT LOG:  invalid resource manager ID 125 at E92/57111748
2022-11-17 13:49:14 GMT LOG:  database system is ready to accept read only connections
2022-11-17 13:49:14 GMT LOG:  started streaming WAL from primary at E92/57000000 on timeline 1
2022-11-17 13:49:15 GMT LOG:  incomplete startup packet

after doing a manual restart:

2022-11-18 13:23:33 GMT LOG:  listening on IPv6 address "::", port 5432
2022-11-18 13:23:33 GMT LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-11-18 13:23:33 GMT LOG:  database system was shut down in recovery at 2022-11-18 13:23:33 GMT
2022-11-18 13:23:33 GMT LOG:  entering standby mode
2022-11-18 13:23:33 GMT LOG:  redo starts at E92/57111668
2022-11-18 13:23:33 GMT LOG:  consistent recovery state reached at E92/57111748
2022-11-18 13:23:33 GMT LOG:  invalid resource manager ID 125 at E92/57111748
2022-11-18 13:23:33 GMT LOG:  database system is ready to accept read only connections
2022-11-18 13:23:33 GMT LOG:  started streaming WAL from primary at E92/57000000 on timeline 1
2022-11-18 13:23:33 GMT LOG:  incomplete startup packet

Will take a look with @hnowlan and see if we can figure out what went wrong

It appeared that maps1005 was stuck in a previous LSN, reporting fewer bytes than received.

We ran a SELECT pg_switch_wal(); which forced a WAL change, and the problem seems to have gone away .

@awight mind if we close this task and revisit if we are having replication issues?

awight claimed this task.

@awight mind if we close this task and revisit if we are having replication issues?

Works for us—thanks again!