Page MenuHomePhabricator

Get rid of deployment-db0[34]
Closed, ResolvedPublic

Description

Due to the failure of cloudvirt1018 last month both of our database instances at the time suffered disk corruption and the work to deal get stuff working again happened in T216404: deployment-db03.deployment-prep.eqiad.wmflabs instance can not start and T216067: Recover from corrupted beta MySQL slave (deployment-db04).
The current situation looking at db-labs.php is that:

  • deployment-db03 is unused (theoretically, I haven't dug into what else could be talking to this beyond MW so don't rely on it)
  • deployment-db04 is the master
  • deployment-db05 is a fresh slave which we should keep and possibly make the new master

When investigating puppet failures today I noticed that deployment-db04 had all sorts of nonsense like puppet function files overwritten with apt data and junk. I do not trust it and we should delete it. db03 is in a similar position though it's at least seemingly unused.

Added to which they're jessie and T218729: Migrate deployment-prep away from Debian Jessie to Debian Stretch/Buster.

Event Timeline

So I suggest we:

  • Confirm db03 is unused, we have any data we need from it, then eliminate it.
  • Create deployment-db06 as stretch (and manually check per T219088 that it is on a different host, this info is visible in horizon and openstack-dashboard) and begin the process of copying data in from deployment-db05, add to MW config.
  • Make deployment-db05 the master (I guess we just need to set stuff read only, swap the replication config around and possibly handle the replication user+permissions, update MW, disable read-only)
  • Get rid of db04.

It doesn't look like mysql/mariadb is actually running on db03 (can't find any references in mediawiki-config or puppet either) so I'm going to shut it off and, assuming nothing comes up, delete it in a couple of weeks.

Mentioned in SAL (#wikimedia-releng) [2019-03-24T16:06:38Z] <Krenair> shut off old deployment-db03 instance per T219087

Created deployment-db06, it's been assigned the same host as -db04 and different to -db05 so that will do, did standard deployment-prep puppet cert stuff, did mysql setup using my steps from T216067#4952271

Based on the above ticket and https://wikitech.wikimedia.org/wiki/Setting_up_a_MySQL_replica#Transferring_Data, running:

  • nc -l -p 9210 | /opt/wmf-mariadb101/bin/mbstream -x from /srv/sqldata in screen import on deployment-db06
  • mariabackup --innobackupex --open-files-limit=8000 --stream=xbstream /srv/sqldata --user=root --slave-info -S /tmp/systemd-private-d6c71da3465641b3aa68e8390a2cc75c-mariadb.service-HPflMl/tmp/mysql.sock 2>backup.log.2 | nc deployment-db06 9210 in screen export on deployment-db05

Lots of stuff is different here compared to the last time because it seems mariabackup on stretch does not do tar, it supports only xbstream.

That completed, on deployment-db06 ran mariabackup --innobackupex --apply-log --use-memory=12G /srv/sqldata, chown -R mysql: /srv, service mariadb start
Ran the following using /root/mysql.sh, which is a shortcut I made for the myqsl.sock thing above because that's gonna get annoying. Wonder why systemd does that.

SET GLOBAL gtid_slave_pos = '0-2886731013-220863708';
CHANGE MASTER to MASTER_USER='repl', MASTER_PASSWORD='get this from deployment-puppetmaster03:/var/lib/git/labs/private/modules/secret/secrets/mysql/repl_password', MASTER_PORT=3306, MASTER_HOST='deployment-db04', master_use_gtid = slave_pos;

tested replication by making a new edit and looking at a query like select * from enwiki.revision order by rev_id desc limit 1 \G, it works.

Change 498729 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] db-labs: add new slave deployment-db06

https://gerrit.wikimedia.org/r/498729

Change 498729 merged by jenkins-bot:
[operations/mediawiki-config@master] db-labs: add new slave deployment-db06

https://gerrit.wikimedia.org/r/498729

Mentioned in SAL (#wikimedia-releng) [2019-03-28T17:23:56Z] <Krenair> deployment-prep T219087 beginning master switch

Change 499830 had a related patch set uploaded (by Alex Monk; owner: Alex Monk):
[operations/mediawiki-config@master] db-labs: Update MW to use new master

https://gerrit.wikimedia.org/r/499830

Change 499830 merged by jenkins-bot:
[operations/mediawiki-config@master] db-labs: Update MW to use new master

https://gerrit.wikimedia.org/r/499830

Mentioned in SAL (#wikimedia-releng) [2019-03-28T18:49:17Z] <Krenair> shut off deployment-db04 instance per T219087

Just got to wait for deletion time now. Will probably give it a couple of weeks.