Page MenuHomePhabricator

Plan and manage the required downtime for the clouddb-services VMs on cloudvirt1019 and cloudvirt1020
Closed, ResolvedPublic

Description

For the most part, the VMs require a failover of the databases and the DNS that points to them to reduce impact to Toolforge and CloudVPS users so that we can do maintenance on the hypervisor servers. However, there is communication to be done and specific steps to take. Those will be documented here.

ToolsDB -- clouddb1001/2
This has unreplicated tables requiring some special efforts and additional downtime for affected tools. Two of the unreplicated databases are only temporarily so (see T257274 and T257275), which will be cleared up by this change. It is worth it to review the others to make sure they are still active and need to remain unreplicated in case we can clear more of these barriers for the future.
When upgrading clouddb1002's hypervisor (cloudvirt1020), we will need to stop replication on clouddb1002 and stop mariadb before the downtime, but users won't be impacted.

OSMDB -- clouddb1003/4
Instead of failover, let's just upgrade this with the DB down. I don't think there is as strong an expectation of uptime for this service.

  • Stop the database and shut down the VM

Event Timeline

Bstorm moved this task from Backlog to ToolsDB on the Data-Services board.
Bstorm triaged this task as Medium priority.Sep 23 2020, 6:43 PM
Bstorm updated the task description. (Show Details)

Now that I started that doc, I suspect that the osmdb service is probably just as well shut down during the upgrade instead of failing over. It'll be such a fuss to move it back and forth that it likely won't be much better considering it isn't very widely used anyway (and has been down for extended periods without anyone even reporting it before).

@Andrew That suggests that we could announce and upgrade cloudvirt1020 whenever you are ready to do it. We'll want that done before the network outage window when I want to fail over toolsdb.

Just to confirm that I'm understand, the steps are:

  1. re-image cloudvirt1020 (will cause osmdb outage but not toolsdb outage)
  2. failover toolsdb from cloudvirt1019 to cloudvirt1020
  3. re-image cloudvirt1019 (will not cause any outages for anyone)

Is that right? And is osmdb backed up enough on 1019 that if the reimage goes badly we have some way to recover lost data? If so, let's do 1020 on Tuesday.

Change #3 to cloudvirt1019, which I presume you intended, and yes. For #1, I'll stop replication and stop the DBs before we re-image.

osmdb is fully replicated and can be started up from clouddb1004 if it goes pear-shaped.

Change #3 to cloudvirt1019, which I presume you intended, and yes.

fixed!

Bstorm updated the task description. (Show Details)

Just remembered that before the DB goes into readonly mode, I might want to switch PAWS into sqlite mode.

Change 636468 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: Fail over toolsdb to its replica

https://gerrit.wikimedia.org/r/636468

Change 636469 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] toolsdb: remove temporarily replication filters

https://gerrit.wikimedia.org/r/636469

Bstorm changed the task status from Open to Stalled.Oct 29 2020, 12:18 AM

We fixed replication and did the reimage at the same time so there wasn't additional downtime on top of all that.

Change 636469 merged by Bstorm:
[operations/puppet@production] toolsdb: remove temporary replication filters

https://gerrit.wikimedia.org/r/636469

Change 636468 abandoned by Bstorm:
[operations/puppet@production] toolsdb: Fail over toolsdb to its replica

Reason:
Not using this now

https://gerrit.wikimedia.org/r/636468