Page MenuHomePhabricator

Upgrade x1 to MariaDB 10.6
Closed, ResolvedPublic

Description

Ideally we should reimage to bookworm

  • dbstore1009
  • db2191
  • db2131
  • db2115
  • db2101 - backup source - tracked in T360751
  • db2097
  • db2096 - will be decommissionned in T358741
  • db1237
  • db1225
  • db1220 - new master
  • db1216 - backup source - tracked in T360751
  • db1179 - old master - ready T359790

Event Timeline

Marostegui triaged this task as Medium priority.
Marostegui updated the task description. (Show Details)
Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui added a subscriber: jcrespo.

Coordinate with @jcrespo for the backup sources - but we can leave those for a moment where more sections are migrated.

As I prepared beforehand for a previous upgrade, s6, x1 and s2 are already producing 10.6-compatible backups, and backup sources should be, at least partially, upgraded- we can just drop the 10.4 ones when fully upgraded and move 10.4 sections instead (I will handle that). So I should not be a blocker for this ticket. If you have a roadmap of future upgrades beyond those sections, I can start working on preparing those now, so I am ready already like in this case.

Thank you for taking me into account!

Mentioned in SAL (#wikimedia-operations) [2024-03-06T14:32:04Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'Depool to reimage T358642', diff saved to https://phabricator.wikimedia.org/P58588 and previous config saved to /var/cache/conftool/dbconfig/20240306-143204-arnaudb.json

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2131.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2131.codfw.wmnet with OS bookworm completed:

  • db2131 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403061454_arnaudb_625899_db2131.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-03-06T15:21:35Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'Depool to clone on db2131 T358642', diff saved to https://phabricator.wikimedia.org/P58589 and previous config saved to /var/cache/conftool/dbconfig/20240306-152130-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-03-07T10:10:05Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'Depool to upgrade T358642', diff saved to https://phabricator.wikimedia.org/P58624 and previous config saved to /var/cache/conftool/dbconfig/20240307-101004-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-03-07T10:11:48Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on db1220.eqiad.wmnet with reason: T358642

Mentioned in SAL (#wikimedia-operations) [2024-03-07T10:12:01Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1220.eqiad.wmnet with reason: T358642

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1220.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1220.eqiad.wmnet with OS bookworm completed:

  • db1220 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403071028_arnaudb_771962_db1220.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1179.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1179.eqiad.wmnet with OS bookworm completed:

  • db1179 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403111356_arnaudb_483161_db1179.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change 1010252 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[operations/puppet@production] mariadb: candidate master add for x1

https://gerrit.wikimedia.org/r/1010252

Change 1010252 merged by Arnaudb:

[operations/puppet@production] mariadb: candidate master add for x1

https://gerrit.wikimedia.org/r/1010252

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2115.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2115.codfw.wmnet with OS bookworm completed:

  • db2115 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403140738_arnaudb_958365_db2115.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

As I prepared beforehand for a previous upgrade, s6, x1 and s2 are already producing 10.6-compatible backups, and backup sources should be, at least partially, upgraded- we can just drop the 10.4 ones when fully upgraded and move 10.4 sections instead (I will handle that). So I should not be a blocker for this ticket. If you have a roadmap of future upgrades beyond those sections, I can start working on preparing those now, so I am ready already like in this case.

Thank you for taking me into account!

remaining servers are backup sources, feel free to let me know if and when I can help!

I have created a task to track the backup sources on their own T360751: Upgrade backup sources to MariaDB 10.6 so I am going to close this as fixed.
Thanks for working on it!

Marostegui updated the task description. (Show Details)