Page MenuHomePhabricator

Upgrade s1 to Buster + MariaDB 10.4
Closed, ResolvedPublic

Description

After the DC switch (scheduled for 14th Sept 2021)

  • Available backup source on the standby DC (db2141 - all sections have sources everywhere now)
  • Switchover backup generation standby DC (Ready to deploy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/721285 )
  • Candidate master on the standby DC (db2112) blocked on T293740
  • Master on the standby DC (db2103)
  • Candidate master on the primary DC
  • Available backup source on the primary DC (db1140 - all sections have sources everywhere now)
  • Switchover backup generation Primary DC (Ready to deploy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/721286 )
  • Switchover on the primary DC to promote a Buster+10.4 host to master: T293964
  • Upgrade the old master and make it a candidate master, pool it
  • Cleanup (remove) old backup sources from both DCs @jcrespo

Please read the doc about procedure for more details.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 721285 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switch s1 backup generation from db2097 to db2141

https://gerrit.wikimedia.org/r/721285

Change 721286 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Switch s1 backup generation from db1139 to db1140

https://gerrit.wikimedia.org/r/721286

Change 721288 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Migrate s8 backups db2100 -> db2098; reimage dbprov2001

https://gerrit.wikimedia.org/r/721288

Change 721652 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Reimage db2103 to Buster

https://gerrit.wikimedia.org/r/721652

Change 721652 merged by Marostegui:

[operations/puppet@production] install_server: Reimage db2103 to Buster

https://gerrit.wikimedia.org/r/721652

Change 723038 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2103: Disable notifications

https://gerrit.wikimedia.org/r/723038

Change 723038 merged by Marostegui:

[operations/puppet@production] db2103: Disable notifications

https://gerrit.wikimedia.org/r/723038

All s1 core codfw replica hosts upgraded to the minor 10.4.21
@jcrespo you might want to also upgrade the source backup (I can do so if you give me green light)

@jcrespo you might want to also upgrade the source backup (I can do so if you give me green light)

Done.

Change 724414 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] install_server: Remove db2103

https://gerrit.wikimedia.org/r/724414

Change 724414 merged by Marostegui:

[operations/puppet@production] install_server: Remove db2103

https://gerrit.wikimedia.org/r/724414

Cookbook cookbooks.sre.experimental.reimage was started by marostegui@cumin1001 for host db2103.codfw.wmnet

Cookbook cookbooks.sre.experimental.reimage completed:

  • db2103 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB
    • Removed from Debmonitor
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/wmf-auto-reimage/202109281303_marostegui_5029_db2103.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed

db2103 (candidate master) reimaged to buster - checking its tables now.

Mentioned in SAL (#wikimedia-operations) [2021-09-28T13:40:12Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2103 T290865', diff saved to https://phabricator.wikimedia.org/P17337 and previous config saved to /var/cache/conftool/dbconfig/20210928-134012-marostegui.json

db2103's check came clean, so I have restarted replication.

Mentioned in SAL (#wikimedia-operations) [2021-09-29T05:56:47Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2103 T290865', diff saved to https://phabricator.wikimedia.org/P17344 and previous config saved to /var/cache/conftool/dbconfig/20210929-055645-marostegui.json

Change 731872 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db2112: Disable notifications

https://gerrit.wikimedia.org/r/731872

Change 731872 merged by Marostegui:

[operations/puppet@production] db2112: Disable notifications

https://gerrit.wikimedia.org/r/731872

Mentioned in SAL (#wikimedia-operations) [2021-10-19T05:46:56Z] <marostegui> Reimage db2112 (s1 codfw master) T290865

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db2112.codfw.wmnet with OS buster

I am having issues with db2112 reimage (codfw master) - going to try to see if I can overcome them and if not, I will do a master switchover as the candidate master is already reimaged.

Looks like it fails to completely load the debian installer after loading the initrd and then after the timeout it reboots itself.
So I think I am going to proceed with the DC switchover and then ask Papaul to upgrade firmwares in case we are bing hit by T216240 again

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db2112.codfw.wmnet with OS buster executed with errors:

  • db2112 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Change 731882 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] mariadb: Promote db2103 to s1 codfw master

https://gerrit.wikimedia.org/r/731882

Change 731882 merged by Marostegui:

[operations/puppet@production] mariadb: Promote db2103 to s1 codfw master

https://gerrit.wikimedia.org/r/731882

Change 721285 merged by Marostegui:

[operations/puppet@production] dbbackups: Switch s1 backup generation from db2097 to db2141

https://gerrit.wikimedia.org/r/721285

I have switched over codfw s1 master, the new master is db2103.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db2112.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db2112.codfw.wmnet with OS buster executed with errors:

  • db2112 (FAIL)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db2112.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db2112.codfw.wmnet with OS buster executed with errors:

  • db2112 (FAIL)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db2112.codfw.wmnet with OS stretch

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db2112.codfw.wmnet with OS stretch completed:

  • db2112 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh stretch OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110191018_marostegui_22737_db2112.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 731938 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1018: Depool clouddb1017

https://gerrit.wikimedia.org/r/731938

Change 731938 merged by Marostegui:

[operations/puppet@production] dbproxy1018: Depool clouddb1017

https://gerrit.wikimedia.org/r/731938

Upgraded all s1 eqiad replicas (apart from backup and candidate master) to 10.4.21.
Pending one clouddb* hosts which I will do today and tomorrow (to avoid having two of them fully cold the same day).

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db2112.codfw.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db2112.codfw.wmnet with OS buster completed:

  • db2112 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110200526_marostegui_1881_db2112.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

db2112 (old master and now candidate master) has been reimaged to Buster. Checking its tables now.

Change 732118 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1019: Depool clouddb1013

https://gerrit.wikimedia.org/r/732118

Change 732118 merged by Marostegui:

[operations/puppet@production] dbproxy1019: Depool clouddb1013

https://gerrit.wikimedia.org/r/732118

Change 732248 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1118: Disable notifications

https://gerrit.wikimedia.org/r/732248

Mentioned in SAL (#wikimedia-operations) [2021-10-20T06:45:30Z] <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1118 (s1) for reimage T290865', diff saved to https://phabricator.wikimedia.org/P17552 and previous config saved to /var/cache/conftool/dbconfig/20211020-064529-marostegui.json

Change 732248 merged by Marostegui:

[operations/puppet@production] db1118: Disable notifications

https://gerrit.wikimedia.org/r/732248

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1118.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1118.eqiad.wmnet with OS buster completed:

  • db1118 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110200648_marostegui_17828_db1118.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

db1118 (candidate master on eqiad) reimaged, checking its tables now.

db1118 is fine, restarted replication.

db2112 is fine, restarted replication

clouddb hosts for s1 are fully upgraded

This is now blocked on the switchover: T293964

Mentioned in SAL (#wikimedia-operations) [2021-11-03T06:10:18Z] <marostegui> Stop replication on db1163 T290865

Change 736328 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1163: Disable notifications

https://gerrit.wikimedia.org/r/736328

Change 736328 merged by Marostegui:

[operations/puppet@production] db1163: Disable notifications

https://gerrit.wikimedia.org/r/736328

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1163.eqiad.wmnet with OS buster

Change 721286 merged by Marostegui:

[operations/puppet@production] dbbackups: Switch s1 backup generation from db1139 to db1140

https://gerrit.wikimedia.org/r/721286

Merged the backup change as well. We had a s1 snapshot that was taken successfully past night so nothing in progress.

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1163.eqiad.wmnet with OS buster completed:

  • db1163 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202111030613_marostegui_15409_db1163.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

db1163 has been reimaged to Buster. I am now checking its tables.

db1163 came back clean - started replication.

Marostegui moved this task from In progress to Done on the DBA board.

Only left the backup sources cleanup by @jcrespo

Change 738195 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Reimage db1116, db1139, db2097, db2100 to buster

https://gerrit.wikimedia.org/r/738195

Change 738195 merged by Jcrespo:

[operations/puppet@production] dbbackups: Reimage db1116, db1139, db2097, db2100 to buster

https://gerrit.wikimedia.org/r/738195

@Marostegui backup upgrade/cleanup is done, I will later reorganize backup distribution (but that is out of scope).

Closing! thank you!!