Page MenuHomePhabricator

Automate DB upgrades
Open, MediumPublic

Description

Now that dbctl is in place, we should build some automation around automatically upgrading databases (mysql, kernel...).
Ideally the script should:

  • Check if the host can be depooled (min_number of hosts in a section)
  • depool
  • Stop MySQL
  • Upgrade the host
  • Reboot it if that's the desire
  • Start mysql
  • Run mysql_upgrade
  • Start replication
  • Once replication is up-to-date use the dbctl -p option to slowly repool the host back in production

Event Timeline

Marostegui triaged this task as Medium priority.Dec 4 2019, 1:49 PM
Marostegui moved this task from Triage to Backlog on the DBA board.

Random note: With auto_schema, it provides a simple interface to do this kind of work. It only needs to have proper support of multiinstance hosts but otherwise it's good to be at least tested. I wrote a POC for mysql upgrades in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/+/master/dbtools/auto_schema/upgrade_mysql.py

Change 748720 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/software@master] auto_schema: Rework upgrade_mysql a bit to reuse code

https://gerrit.wikimedia.org/r/748720

Change 749176 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/cookbooks@master] Add MySQL upgrade cookbook

https://gerrit.wikimedia.org/r/749176

Change 749195 had a related patch set uploaded (by Jbond; author: jbond):

[operations/cookbooks@master] Add MySQL upgrade cookbook

https://gerrit.wikimedia.org/r/749195

Change 749176 merged by jenkins-bot:

[operations/cookbooks@master] Add MySQL upgrade cookbook

https://gerrit.wikimedia.org/r/749176

Change 751157 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/cookbooks@master] sre.myql.upgrade: Fix missing argument

https://gerrit.wikimedia.org/r/751157

Change 751157 merged by jenkins-bot:

[operations/cookbooks@master] sre.myql.upgrade: Fix missing argument

https://gerrit.wikimedia.org/r/751157

Change 751161 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/cookbooks@master] sre.mysql.upgrade: Fix argparse

https://gerrit.wikimedia.org/r/751161

Change 751161 merged by jenkins-bot:

[operations/cookbooks@master] sre.mysql.upgrade: Fix argparse

https://gerrit.wikimedia.org/r/751161

Change 751225 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/cookbooks@master] sre.mysql.upgrade: Fix calling icinga with list

https://gerrit.wikimedia.org/r/751225

Change 751225 merged by jenkins-bot:

[operations/cookbooks@master] sre.mysql.upgrade: Fix calling icinga with list

https://gerrit.wikimedia.org/r/751225

Change 751226 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/cookbooks@master] sre.mysql.upgrade: Fix the icinga, second try

https://gerrit.wikimedia.org/r/751226

Change 751226 merged by jenkins-bot:

[operations/cookbooks@master] sre.mysql.upgrade: Fix the icinga, second try

https://gerrit.wikimedia.org/r/751226

Change 751228 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/cookbooks@master] sre.mysql.upgrade: Add logger object

https://gerrit.wikimedia.org/r/751228

Change 751228 merged by jenkins-bot:

[operations/cookbooks@master] sre.mysql.upgrade: Add logger object

https://gerrit.wikimedia.org/r/751228

Ladsgroup moved this task from Backlog to In progress on the DBA board.

I just did an upgrade using a cookbook: P18336

It doesn't depool yet but I let auto_schema handle that part.

Change 748720 merged by jenkins-bot:

[operations/software@master] auto_schema: Rework upgrade_mysql to reuse code and cookbooks

https://gerrit.wikimedia.org/r/748720

Change 752700 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/software@master] auto_schema: Force depool in codfw for mysql upgrades

https://gerrit.wikimedia.org/r/752700

Change 752700 merged by jenkins-bot:

[operations/software@master] auto_schema: Force depool in codfw for mysql upgrades

https://gerrit.wikimedia.org/r/752700

Change 754872 had a related patch set uploaded (by Volans; author: Volans):

[operations/cookbooks@master] sre.mysql.upgrade: various improvements

https://gerrit.wikimedia.org/r/754872

Mentioned in SAL (#wikimedia-operations) [2022-01-19T13:35:15Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1100 (T239814)', diff saved to https://phabricator.wikimedia.org/P18864 and previous config saved to /var/cache/conftool/dbconfig/20220119-133514-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-19T17:16:41Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1100 (T239814)', diff saved to https://phabricator.wikimedia.org/P18885 and previous config saved to /var/cache/conftool/dbconfig/20220119-171640-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-19T18:01:55Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1100 (T239814)', diff saved to https://phabricator.wikimedia.org/P18888 and previous config saved to /var/cache/conftool/dbconfig/20220119-180154-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-19T18:08:40Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1110 (T239814)', diff saved to https://phabricator.wikimedia.org/P18889 and previous config saved to /var/cache/conftool/dbconfig/20220119-180840-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-19T18:16:24Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1110 (T239814)', diff saved to https://phabricator.wikimedia.org/P18890 and previous config saved to /var/cache/conftool/dbconfig/20220119-181623-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-01-19T19:01:38Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1110 (T239814)', diff saved to https://phabricator.wikimedia.org/P18893 and previous config saved to /var/cache/conftool/dbconfig/20220119-190137-ladsgroup.json

I ran the automatic upgrades in two eqiad hosts in s5, for one the db simply just refused to get back but that seemed to be the issue of the host being old/etc. but a powercycle seemed to fix the issue. The second one went fully smoothly and no issues so far.

Ladsgroup moved this task from In progress to Ready on the DBA board.

We are currently upgrading all of ours hosts to bullseye which requires a different cookbook so we can't focus on this work for a while. Moving back the task to reflect the reality and we will pick this up and finish it (including support for multiinstance and hosts that have replicas) maybe in Q4