Page MenuHomePhabricator

Upgrade s8 to Buster + MariaDB 10.4
Closed, ResolvedPublic

Description

After the DC switch (scheduled for 14th Sept 2021)

  • Available backup source on the standby DC (db2098 - all sections have sources everywhere now)
  • Switchover backup generation standby DC (Ready to deploy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/731404 )
  • Candidate master on the standby DC
  • Master on the standby DC
  • Candidate master on the primary DC
  • Available backup source on the primary DC (db1171 - all sections have sources everywhere now)
  • Switchover backup generation Primary DC @jcrespo https://gerrit.wikimedia.org/r/c/operations/puppet/+/736946
  • Switchover on the primary DC to promote a Buster+10.4 host to master: T294321
  • Upgrade the old master and make it a candidate master, pool it
  • Cleanup (remove) old backup sources from both DCs

Please read the doc about procedure for more details.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 721288 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Migrate s8 backups db2100 -> db2098; reimage dbprov2001

https://gerrit.wikimedia.org/r/721288

Mentioned in SAL (#wikimedia-operations) [2021-09-23T11:19:53Z] <marostegui> Upgrade db2081 db2082 db2083 db2084 db2091 db2152 T290868

All s8 core codfw replica hosts upgraded to the minor 10.4.21

Change 724237 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db2080

https://gerrit.wikimedia.org/r/724237

Change 724237 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db2080

https://gerrit.wikimedia.org/r/724237

Change 724321 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2080: Disable notifications

https://gerrit.wikimedia.org/r/724321

Change 724321 merged by Marostegui:

[operations/puppet@production] db2080: Disable notifications

https://gerrit.wikimedia.org/r/724321

Cookbook cookbooks.sre.experimental.reimage was started by marostegui@cumin1001 for host db2080.codfw.wmnet

Cookbook cookbooks.sre.experimental.reimage completed:

  • db2080 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB
    • Removed from Debmonitor
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/wmf-auto-reimage/202109280830_marostegui_16212_db2080.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal

db2080 (candidate master) reimaged - now checking its tables.

Mentioned in SAL (#wikimedia-operations) [2021-09-28T13:40:31Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2080 T290868', diff saved to https://phabricator.wikimedia.org/P17339 and previous config saved to /var/cache/conftool/dbconfig/20210928-134030-marostegui.json

db2080 crashed while checking its tables - so it needs to be recloned.

Mentioned in SAL (#wikimedia-operations) [2021-09-29T04:50:34Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2081 T290868', diff saved to https://phabricator.wikimedia.org/P17342 and previous config saved to /var/cache/conftool/dbconfig/20210929-045033-marostegui.json

Change 724584 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2081: Disable notifications

https://gerrit.wikimedia.org/r/724584

Change 724584 merged by Marostegui:

[operations/puppet@production] db2081: Disable notifications

https://gerrit.wikimedia.org/r/724584

db2080 has been recloned - checking its tables again.
Replication has been configured but not started.

Mentioned in SAL (#wikimedia-operations) [2021-09-29T07:25:21Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2081 T290868', diff saved to https://phabricator.wikimedia.org/P17345 and previous config saved to /var/cache/conftool/dbconfig/20210929-072520-marostegui.json

db2080 has crashed again, I am going to restore this from a logical backup. Given that it is the candidate master, I don't want to take any risks.

Actually, I am going to try to restore from snapshots and see if that works (and we can double check if the backup source host is ok too)

Using: dbprov2001.codfw.wmnet:/srv/backups/snapshots/latest/snapshot.s8.2021-09-29--19-00-03.tar.gz (This is a 10.1 snapshot)

db2080 has been recloned from the backup. I have configured replication but NOT started it.
Now I am checking its tables again to see if something breaks.

So far both crashes are related to wbt_item_terms, which is almost 200GB. I am checking it right now to see if it makes it crash again.

It worked fine:

mysql:root@localhost [wikidatawiki]> check table wbt_item_terms;



+-----------------------------+-------+----------+----------+
| Table                       | Op    | Msg_type | Msg_text |
+-----------------------------+-------+----------+----------+
| wikidatawiki.wbt_item_terms | check | status   | OK       |
+-----------------------------+-------+----------+----------+
1 row in set (3 hours 39 min 14.537 sec)

mysql:root@localhost [wikidatawiki]>
mysql:root@localhost [wikidatawiki]>
mysql:root@localhost [wikidatawiki]>
mysql:root@localhost [wikidatawiki]>

I am letting the automatic check to finish though (it will do another check to that table)

db2080's check finally came back clean. I have started replication and so far so good.

Mentioned in SAL (#wikimedia-operations) [2021-10-01T08:43:46Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2080 T290868', diff saved to https://phabricator.wikimedia.org/P17390 and previous config saved to /var/cache/conftool/dbconfig/20211001-084345-marostegui.json

Change 731401 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2079: Disable notifications

https://gerrit.wikimedia.org/r/731401

Change 731401 merged by Marostegui:

[operations/puppet@production] db2079: Disable notifications

https://gerrit.wikimedia.org/r/731401

Mentioned in SAL (#wikimedia-operations) [2021-10-18T11:49:11Z] <marostegui> Reimage db2079 (codfw s8 master) T290868

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db2079.codfw.wmnet with OS buster

Change 731404 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbbackups: Migrate s8 backups db2100 -> db2098

https://gerrit.wikimedia.org/r/731404

Change 721288 abandoned by Marostegui:

[operations/puppet@production] dbbackups: Migrate s8 backups db2100 -> db2098; reimage dbprov2001

Reason:

Using this https://gerrit.wikimedia.org/r/c/operations/puppet/+/731404/ instead

https://gerrit.wikimedia.org/r/721288

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db2079.codfw.wmnet with OS buster completed:

  • db2079 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110181151_marostegui_20209_db2079.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

db2079 (s8 codfw master) has been reimaged - checking its tables now.

Change 731404 merged by Marostegui:

[operations/puppet@production] dbbackups: Migrate s8 backups db2100 -> db2098

https://gerrit.wikimedia.org/r/731404

Upgraded all s8 eqiad replicas (apart from backup and candidate master) to 10.4.21.
Pending two clouddb* hosts which I will do today and tomorrow (to avoid having two of them fully cold the same day).

Change 732115 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1126: Disable notifications

https://gerrit.wikimedia.org/r/732115

Change 732115 merged by Marostegui:

[operations/puppet@production] db1126: Disable notifications

https://gerrit.wikimedia.org/r/732115

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1126.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1126.eqiad.wmnet with OS buster completed:

  • db1126 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110200614_marostegui_11291_db1126.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

db1126's table showed a corruption on wbt_item_terms's index, so I am fixing that one.

Change 732472 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1018: Depool clouddb1020

https://gerrit.wikimedia.org/r/732472

Change 732472 merged by Marostegui:

[operations/puppet@production] dbproxy1018: Depool clouddb1020

https://gerrit.wikimedia.org/r/732472

Table rebuilt, re-checking it again.

mysql:root@localhost [wikidatawiki]> check table wbt_item_terms;
+-----------------------------+-------+----------+----------+
| Table                       | Op    | Msg_type | Msg_text |
+-----------------------------+-------+----------+----------+
| wikidatawiki.wbt_item_terms | check | status   | OK       |
+-----------------------------+-------+----------+----------+
1 row in set (2 hours 6 min 2.990 sec)

So going to start replication again.

Change 732717 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] backups: Send db2098:3318 to dbprov2002

https://gerrit.wikimedia.org/r/732717

Change 732717 merged by Marostegui:

[operations/puppet@production] backups: Send db2098:3318 to dbprov2002

https://gerrit.wikimedia.org/r/732717

Change 734060 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1109: Disable notifications

https://gerrit.wikimedia.org/r/734060

Mentioned in SAL (#wikimedia-operations) [2021-10-25T04:30:29Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1109 (s8) for reimage T290868', diff saved to https://phabricator.wikimedia.org/P17590 and previous config saved to /var/cache/conftool/dbconfig/20211025-043028-marostegui.json

Change 734060 merged by Marostegui:

[operations/puppet@production] db1109: Disable notifications

https://gerrit.wikimedia.org/r/734060

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1109.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1109.eqiad.wmnet with OS buster completed:

  • db1109 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110250501_marostegui_307_db1109.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

db1109 (candidate master) has been reimaged - now getting its tables checked.

db1109's check was fine. Restarted replication

Change 736945 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1019: Depool clouddb1016

https://gerrit.wikimedia.org/r/736945

Change 736945 merged by Marostegui:

[operations/puppet@production] dbproxy1019: Depool clouddb1016

https://gerrit.wikimedia.org/r/736945

Upgraded clouddb1016 - the last host pending on the section.

Change 736946 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbbackups: Switch s8 backup generation from db1116 to db1171

https://gerrit.wikimedia.org/r/736946

@jcrespo I have put this patch up to switch s8 backups from 10.1 to 10.4, please review it and amend if it needed: https://gerrit.wikimedia.org/r/c/operations/puppet/+/736946/

Change 738057 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1104: Disable notifications

https://gerrit.wikimedia.org/r/738057

Change 738057 merged by Marostegui:

[operations/puppet@production] db1104: Disable notifications

https://gerrit.wikimedia.org/r/738057

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1104.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1104.eqiad.wmnet with OS buster completed:

  • db1104 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111110610_marostegui_10265_db1104.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

old master (db1104) reimaged to Buster - checking its tables now.

Change 736946 merged by Jcrespo:

[operations/puppet@production] dbbackups: Switch s8 backup generation from db1116 to db1171

https://gerrit.wikimedia.org/r/736946

Change 738195 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Reimage db1116, db1139, db2097, db2100 to buster

https://gerrit.wikimedia.org/r/738195

Change 738195 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reimage db1116, db1139, db2097, db2100 to buster

https://gerrit.wikimedia.org/r/738195

db1104's check is fine. I am going to start repooling.

@jcrespo you can proceed with the backup sources clean up

@Marostegui backup upgrade/cleanup is done, I will later reorganize backup distribution (but that is out of scope).

Nice! Closing this then!
Thank you