Page MenuHomePhabricator

Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration
Closed, ResolvedPublic

Description

Per T187962, both servers in this cluster need to move. This task is to manage any failover and coordination needed for this move.
Proposed dates are currently July 10th for labsdb1006 (slave) and July 11th for labsdb1007 (current master).

Event Timeline

The first thing to do on this one is find the needed subscribers for these servers and fix what I see is broken replication on labsdb1006. It is stuck on a removed WAL file, so depending on the configuration, it might need a manual resync in backup mode or some fiddling with streaming replication.

@chasemp Do you know some of the right tags/maintainers to add on this one?

I can probably indeed help. I presently have no idea what's up with labs1006, but labsdb1007 is the alias osmdb.eqiad.wmnet which is directly used by the maps labs project. I 've added @Kolossos and @dschwen as 2 of the people I know use this infrastructure. As far as I know, a lot of map tiles are pregerenated and at least some functionality will not be directly impacted by a downtime of the service. I am willing to be there will be functionality that will be impacted however, but @Kolossos and @dschwen can probably shed more light into this.

I can arrange be around for the move to make sure everything is fine.

Please note I never ended up setting up labsdb1006 because puppet support broke on server upgrade T157359#3773751

I know this is out of scope, but I wonder if we should not try to do an upgrade to stretch at the same time. I know this is out of scope, but I hate doing postgres maintenance, so I would prefer to do it only once. Because this is out of scope, I only ask you to consider it, not propsing it to do it striongly (I don't even know which version comes with stretch).

I know this is out of scope, but I wonder if we should not try to do an upgrade to stretch at the same time. I know this is out of scope, but I hate doing postgres maintenance, so I would prefer to do it only once. Because this is out of scope, I only ask you to consider it, not propsing it to do it striongly (I don't even know which version comes with stretch).

It's probably a good idea. We will anyway have to upgrade them and bundling in a reimage and data repopulation of labsdb1006 (should be easy enough) and then making it the master and switching users to it will probably achieve a way lower downtime overall and allow us to move/reimage labsdb1007 at our leisure.

@Bstorm. I can claim this, but let me ask first, do we have a timeline for this ?

A timeline for upgrading to stretch or this move event? The basic datacenter reconfig is just scheduled to happen on the dates in the description. To make 1006 the master, we should be mindful that replication needs to be fully set up (it is non-functional at the moment).

A timeline for upgrading to stretch or this move event?

Both actually. Ok, I see July 10th for labsdb1006. I think it's a bit too late if we want to do the stretch upgrade as well. Could we bump this say 5 days before, that is on July 5th? That would give us the time required for the aforementioned planned to work. Mind you there is nothing labsdb1006 does right now so we can just shut it down at any point in time.

To make 1006 the master, we should be mindful that replication needs to be fully set up (it is non-functional at the moment).

No that's not needed in this case, cause it's a peculiar one. The servers are a mirror of openstreetmap so we can just resync fully from upstream (plus some minor pg_dump+import right before the switch). But

Bump up the switch move or the stretch upgrade? The switch move is scheduled for the 10th and 11th to avoid conflicts with holiday plans and so forth as well as to coincide with two other database clusters moving on the same days.

No that's not needed in this case, cause it's a peculiar one. The servers are a mirror of openstreetmap so we can just resync fully from upstream (plus some minor pg_dump+import right before the switch). But

That's good to hear :)

Vvjjkkii renamed this task from Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration to 9zaaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii removed Bstorm as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from 9zaaaaaaaa to Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration.Jul 2 2018, 6:42 AM
CommunityTechBot assigned this task to Bstorm.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Bump up the switch move or the stretch upgrade? The switch move is scheduled for the 10th and 11th to avoid conflicts with holiday plans and so forth as well as to coincide with two other database clusters moving on the same days.

Sorry, lost this update among the vandalism spam storm. I meant the stretch upgrade. From what I gather that can happen at any point in time. Then on the 10th and after labsdb1006 has been successfully moved we can update the DNS records for osmdb.eqiad.wmnet, making it easy to do whatever kind of maintenance we want on labsdb1007

Change 443799 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] labsdb1006: Reimage as stretch and make it osm::master

https://gerrit.wikimedia.org/r/443799

Change 443799 merged by Bstorm:
[operations/puppet@production] labsdb1006: Reimage as stretch and make it osm::master

https://gerrit.wikimedia.org/r/443799

Change 444771 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] openstreetmap: add debian stretch to puppet role

https://gerrit.wikimedia.org/r/444771

Change 444771 merged by Bstorm:
[operations/puppet@production] openstreetmap: add debian stretch to puppet role

https://gerrit.wikimedia.org/r/444771

Re-imaged labsdb1006 to stretch. In the process, I found that the storage is a bit odd. One of the LVs is named "_placeholder", which prevents puppet from working and it isn't mounted. This could be by design. I renamed the _placeholder to the correct volume name, similar to the current master and apparently had to create a filesystem on it. Puppet created the directory tree there once I mounted it, and I think the cron job that syncs over files from OSM should run in a few minutes (checking var/spool). If that finishes by morning, it should actually be ready then. This was a bit heavier than I expected, but it might work out.

Well, that was silly of me. Of course there are a bunch of roles and things not created on the re-imaged server. It'll probably need some kind of dump and restore to make this easy unless there's a doc around.

labsdb1006 is now also moved. Asked @akosiaris for help getting it up correctly as a master.

Do we need to do a dns update now? I was trying to fix labsdb1006 puppet run, but I found @akosiaris working on it at the moment and didn't want to modify anything without your permission. Is the OSM import running?

Do we need to do a dns update now? I was trying to fix labsdb1006 puppet run, but I found @akosiaris working on it at the moment and didn't want to modify anything without your permission. Is the OSM import running?

Yes, the OSM import is currently running. Coastlines and land polygons have been imported already

Do we need to do a dns update now? I was trying to fix labsdb1006 puppet run, but I found @akosiaris working on it at the moment and didn't want to modify anything without your permission. Is the OSM import running?

Yes, the OSM import is currently running. Coastlines and land polygons have been imported already

An update, it's calculating ways now. 384071k out of 508857k with a rate of 6.31k/s. And then it's the relations as well

Where are we on failing over to this now? I'd like to get re-imaging the 1007.

Change 448393 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] osm: Populate the osmupdater user

https://gerrit.wikimedia.org/r/448393

Change 448393 merged by Alexandros Kosiaris:
[operations/puppet@production] osm: Populate the osmupdater user

https://gerrit.wikimedia.org/r/448393

akosiaris triaged this task as Medium priority.Jul 27 2018, 2:15 PM

Where are we on failing over to this now? I'd like to get re-imaging the 1007.

I 've just finished the last parts, i.e. making sure the periodic openstreetmap planet sync works and dumping from labsdb1007 and importing on labsdb1006 the various user databases and tables. I think we are ready to failover, let's schedule for Monday 30th ?

Failover is accomplished via DNS? osmdb, right?

Or is there something hidden in here that I haven't seen.

There's also OSM replication state (/srv/osmosis/state.txt) that needs to be moved to the new server.

That should already be present, since both have the osm::master role. I can change the DNS on Monday morning (PST) if that's ok.

Failover is accomplished via DNS? osmdb, right?

Exactly.

Or is there something hidden in here that I haven't seen.

No, I don't think so.

There's also OSM replication state (/srv/osmosis/state.txt) that needs to be moved to the new server.

It's there already.

Change 449220 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] osmdb: failing over to labsdb1006

https://gerrit.wikimedia.org/r/449220

Change 449220 merged by Bstorm:
[operations/dns@master] osmdb: failing over to labsdb1006

https://gerrit.wikimedia.org/r/449220

It's up, I'd expect some quirks while dns propagates, but we should be good to re-image 007 by tomorrow.

If everything looks ok to everyone else, I could demote 1007 to a slave in the roles and start re-imaging.

If everything looks ok to everyone else, I could demote 1007 to a slave in the roles and start re-imaging.

+1

Change 450251 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] osmdb: Change labsdb1007 back to slave and reimage as stretch

https://gerrit.wikimedia.org/r/450251

Change 450251 merged by Bstorm:
[operations/puppet@production] osmdb: Change labsdb1007 back to slave and reimage as stretch

https://gerrit.wikimedia.org/r/450251

So labsdb1007 is back up and configured as far as labsdb1006 was after reimage. @akosiaris do you have the remaining steps?

Change 454249 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] osm: Slave is labsdb1007, not labsdb1006

https://gerrit.wikimedia.org/r/454249

Change 454249 merged by Alexandros Kosiaris:
[operations/puppet@production] osm: Slave is labsdb1007, not labsdb1006

https://gerrit.wikimedia.org/r/454249

Change 454255 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] osm: The master is osmdb.eqiad.wmnet, not labsdb1007

https://gerrit.wikimedia.org/r/454255

Change 454255 merged by Alexandros Kosiaris:
[operations/puppet@production] osm: The master is osmdb.eqiad.wmnet, not labsdb1007

https://gerrit.wikimedia.org/r/454255

It looks like this is done. We dragged this ticket on for the re-image anyway, so I'll close it. We can always open more.