Page MenuHomePhabricator

maps-test200{2-4} PostgreSQL replication needs rebuilding
Closed, ResolvedPublic

Description

Replication broke again, and I posted it on dba stackexchange. There is an answer how to recover and how to avoid it in the future. Any feedback is welcome of course.

LOG:  started streaming WAL from primary at 182/0 on timeline 1
FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000018200000000 has already been removed

Event Timeline

Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik added subscribers: Yurik, akosiaris, jcrespo, MaxSem.

The reason is almost certainly this: https://phabricator.wikimedia.org/P2231

Dropping and recreating the PRIMARY KEY in a 100GB table is not exactly the best thing you can do to a database. As far as the tuning of the checkpoint_segments parameter goes, the default is 3 and we 've already increased it to 64. Each segment is a 16MB WAL file. We could always increase it more, but it is clear to me that we would not have avoided the problem in this specific case, it would just have happened a bit later (and I do mean a bit, that is a few minutes). Which would not be enough, since the operation started in 1:37:31, the problem manifested less than 4 minutes after that in 1:41:20 and the operation continued up to 2:25 according to this

https://grafana.wikimedia.org/dashboard/db/server-board?from=1445733188203&to=1445742598391&var-server=maps-test2001&var-network=eth0

The slaves recovering from this gracefully, that is an impossible feat since the WAL files have been deleted. It is going to be a destructive procedure since the slaves need to be re-initialized from the master. I 'll start it shortly.

And can we please avoid these kinds of operations in the future without some prior discussion ?

akosiaris triaged this task as Unbreak Now! priority.Oct 26 2015, 12:42 PM
akosiaris closed this task as Resolved.EditedOct 26 2015, 12:47 PM
akosiaris claimed this task.

After a full reinitialization of the slaves, replication is working once more. I see a couple more things.

  • Monitor postgresql replication status T116580
  • Increase the checkpoint_segments parameter. The actual number to set is quite hard however to figure out. I propose we follow a disk space usage approach. Having WALs occupy ~50GB of space could help avoid similar issues in the future, but it definitely is not a bulletproof solution.
  • Make sure to communicate such changes in the future.

@akosiaris, sorry for the trouble. Just in case I break it again, could you write the sequence of steps that you did to recover it? This way i won't have to bug anyone next time it happens ))

Also, this answer mentions replication slots - would it make sense to use them?

Oh, it is basically reinitializing the slave.

stop postgres
mv /srv/postgres/9.4/main/recovery.conf ~/
rm -rf /srv/postgres/9.4/main
/usr/bin/pg_basebackup -x -D /srv/postgresql/9.4/main -h maps-test2001.codfw.wmnet -U replication -w
mv ~/recovery.conf /srv/postgres/9.4/main/
start postgres

As far replication slots go, this is a new feature just introduced in PostgreSQL 9.4 and for the "new" logical decoding ( http://www.postgresql.org/docs/9.4/static/logicaldecoding.html) . Logical decoding can be used to enable the logical replication feature. That replication slots feature was designed to support exactly that use case. That being said it also works for our case where we are using the "old" physical replication. It's a new feature so we should experiment in using it but for at least a while we should not rely the replication on it.

@akosiaris, thx, but i suspect we won't be able to do most of these steps due to perm?

@akosiaris, thx, but i suspect we won't be able to do most of these steps due to perm?

Yup, that's correct.

Change 249096 had a related patch set uploaded (by Alexandros Kosiaris):
maps: Tune replication parameters

https://gerrit.wikimedia.org/r/249096

Change 249096 merged by Alexandros Kosiaris:
maps: Tune replication parameters

https://gerrit.wikimedia.org/r/249096