Page MenuHomePhabricator

Move orchestrator from db2093 to db1115
Closed, ResolvedPublic

Description

Once db1115 has been reimaged and cleaned up, we should get orchestrator to start using db1115 instead of db2093 so we can have consistency (eqiad being the writable DC).
@Kormat I will ping you once this can happen.

  • Move orchestrator DB from db2093 to db1115 (enable replication)
    • Fix orchstrator_srv grant on db1115, ensure it works from dborch1001.
    • Disable puppet on dborch1001.
    • Stop orchestrator
    • Merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/774485
    • Take a mysqldump of the orchestrator database on db2093: sudo mysqldump --single-transaction --ssl-verify-server-cert -h db2093.codfw.wmnet --databases orchestrator > orchestrator.dump.T301315
    • Load dump into the test section: sudo db-mysql db1124 < orchestrator.dump.T301315. Double-check that it loads correctly, and that the data looks good.
    • Drop orchestrator database on db2093: sudo db-mysql db2093 -e 'drop database orchestrator'
    • Load mysqldump into db1115, with replication enabled: sudo db-mysql db1115 < orchestrator.dump.T301315
    • Run puppet on dborch1001, start orchestrator.
  • Reimage db2093 to Bullseye
  • Start to backup orchestrator on db1115

Event Timeline

Marostegui changed the task status from Open to Stalled.Feb 9 2022, 6:46 AM
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Blocked on the DBA board.
Marostegui changed the task status from Stalled to Open.Feb 10 2022, 7:05 AM

@Kormat db1115 is now running Bullseye + 10.4, this can now proceed. Once done, please reimage db2093 to bullseye

Change 774485 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] orchestrator: Switch to db1115 as backend.

https://gerrit.wikimedia.org/r/774485

Regarding the plan I would include the following steps before dropping the database in db2093:

  • Check current grants on db2093 to make sure they are the same on db1115
  • Load the mysqldump into the test hosts (db1124 is the master) to make sure it is taken fine and it gets loaded fine as well. Selecting data from a few tables wouldn't hurt either.
  • Double check that zarcillo lives in db2093 indeed before doing the stop+reimage.

Once the above is all checked/done it should be safe to proceed with the drop database orchestrator part on db2093

Regarding the plan I would include the following steps before dropping the database in db2093:

  • Check current grants on db2093 to make sure they are the same on db1115

Grants audited. There are some differences, but either they don't matter (i.e. old tendril stuff that can get cleaned up later), or they're dumps-related (which will need adjusting anyway).

  • Load the mysqldump into the test hosts (db1124 is the master) to make sure it is taken fine and it gets loaded fine as well. Selecting data from a few tables wouldn't hurt either.

Added to procedure.

  • Double check that zarcillo lives in db2093 indeed before doing the stop+reimage.

I can confirm that db2093 sees the updated s3 primary from the switchover yesterday, so it's definitely present, and getting updates.

Once the above is all checked/done it should be safe to proceed with the drop database orchestrator part on db2093

Cool, thanks!

Mentioned in SAL (#wikimedia-operations) [2022-03-30T13:55:45Z] <kormat> stopping orchestrator for backend move T301315

Change 774485 merged by Kormat:

[operations/puppet@production] orchestrator: Switch to db1115 as backend.

https://gerrit.wikimedia.org/r/774485

Cookbook cookbooks.sre.hosts.reimage was started by kormat@cumin1001 for host db2093.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kormat@cumin1001 for host db2093.codfw.wmnet with OS bullseye completed:

  • db2093 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202203301410_kormat_1380369_db2093.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 775330 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: Use ROW binlog_format for db_inventory.

https://gerrit.wikimedia.org/r/775330

@jcrespo: Can i hand this task over to you? I think the dumps grants probably need some cleanup, if you can have a look at that, too.

Change 775852 had a related patch set uploaded (by Kormat; author: Kormat):

[operations/puppet@production] mariadb: Stop special-casing db2093

https://gerrit.wikimedia.org/r/775852

@Kormat I can help setting up backups, but someone has to make clear first what to backup (what databases?), from which servers/dcs, and how frequently they are needed.

@Kormat I can help setting up backups, but someone has to make clear first what to backup (what databases?), from which servers/dcs, and how frequently they are needed.

Both db1115 and db2093 are already being backed up under the 'zarcillo' section name. The databases that need backing up are orchestrator and zarcillo. Whatever the existing frequency is should be fine.

Thanks, that info makes sure there is no missunderstandings, as I didn't follow all details of the task. I will create a patch and will ask for your review + later check of the generated backup to make sure it is as expected.

One last thing- does orchestrator + zarcillo server have a name? Backups (jobs) have a completely arbitrary identifier for monitoring purposes- any suggestion (even if it is orchestrator OR zarcillo containing both)? Alternatively, it is possible to setup separate backup jobs, even for the same server.

One last thing- does orchestrator + zarcillo server have a name? Backups (jobs) have a completely arbitrary identifier for monitoring purposes- any suggestion (even if it is orchestrator OR zarcillo containing both)? Alternatively, it is possible to setup separate backup jobs, even for the same server.

db_inventory is the best name; i need to get around to doing some cleanup on the puppet code to use that more consistently, so if you could already make that change on the backups side, that would be appreciated :)

db_inventory is the best name; i need to get around to doing some cleanup on the puppet code to use that more consistently, so if you could already make that change on the backups side, that would be appreciated :)

👍

Because T303178 this may take longer than what I supposed was a trivial change. I will setup the backup ASAP, but for backup checks I will have to think the best way to workaround a list of valid sections to backup.

Change 776169 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Start backing up orchestrator & rename section db_inventory

https://gerrit.wikimedia.org/r/776169

Change 776170 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Monitor db_inventory rather than zarcillo section

https://gerrit.wikimedia.org/r/776170

Change 776171 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/software/wmfbackups@master] check: Make zarcillo an invalid section, make db_inventory a valid one

https://gerrit.wikimedia.org/r/776171

Change 776169 merged by Jcrespo:

[operations/puppet@production] dbbackups: Start backing up orchestrator & rename section db_inventory

https://gerrit.wikimedia.org/r/776169

Change 776969 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Setup a valid_sections.txt config for db backup checks

https://gerrit.wikimedia.org/r/776969

Change 776969 merged by Jcrespo:

[operations/puppet@production] dbbackups: Setup a valid_sections.txt config for db backup checks

https://gerrit.wikimedia.org/r/776969

Change 777312 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Update valid_sections.txt permissions to be world-readable

https://gerrit.wikimedia.org/r/777312

Change 777312 merged by Jcrespo:

[operations/puppet@production] dbbackups: Update valid_sections.txt permissions to be world-readable

https://gerrit.wikimedia.org/r/777312

Change 776170 merged by Jcrespo:

[operations/puppet@production] dbbackups: Monitor db_inventory rather than zarcillo section

https://gerrit.wikimedia.org/r/776170

Backups setup and monitoring, too:

Screenshot_20220405_114526.png (135×1 px, 43 KB)

The only thing left is to check (by DBAs) that the backup available at (e.g.) : dbprov1002:/srv/backups/dumps/latest/dump.db_inventory.2022-04-05--03-13-31 looks as expected (content-wise).

The only thing left is to check (by DBAs) that the backup available at (e.g.) : dbprov1002:/srv/backups/dumps/latest/dump.db_inventory.2022-04-05--03-13-31 looks as expected (content-wise).

I had a look at a few of the tables in the dump, and they all look fine. Thanks!

Change 776171 merged by Jcrespo:

[operations/software/wmfbackups@master] check: Read list of valid sections/valid backup jobs from a file

https://gerrit.wikimedia.org/r/776171

Change 775852 merged by Kormat:

[operations/puppet@production] mariadb: Stop special-casing db2093

https://gerrit.wikimedia.org/r/775852

Change 775330 merged by Kormat:

[operations/puppet@production] mariadb: Use ROW binlog_format for db_inventory.

https://gerrit.wikimedia.org/r/775330