Page MenuHomePhabricator

Upgrade backup source or mediabackup database host os to Debian bookworm or decommission them
Closed, ResolvedPublic

Description

Backup sources

  • db1150
  • db1171
  • db1216
  • db1225
  • db1239
  • db1240
  • db1245
  • db2139 - to be decommed
  • db2141
  • db2197
  • db2198
  • db2199
  • db2200
  • db2201
  • db2239

backup1 hosts

  • db1204
  • db1205
  • db2183
  • db2184

Related Objects

StatusSubtypeAssignedTask
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
ResolvedMarostegui
ResolvedMarostegui
DeclinedABran-WMF
ResolvedABran-WMF
ResolvedABran-WMF
ResolvedLadsgroup
ResolvedMarostegui
ResolvedMarostegui
ResolvedRequestJclark-ctr
ResolvedRequestJclark-ctr
ResolvedMarostegui
ResolvedMarostegui
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedRequestJhancock.wm
ResolvedABran-WMF

Event Timeline

jcrespo changed the task status from Open to In Progress.
jcrespo claimed this task.
jcrespo triaged this task as High priority.
jcrespo edited projects, added database-backups, media-backups; removed Epic.
jcrespo changed the status of subtask T366092: Upgrade eqiad mediabackups database hosts to Debian Bookworm from Open to In Progress.
jcrespo removed a subscriber: Aklapper.

Change #1112184 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] installserver: Review backup and db hosts

https://gerrit.wikimedia.org/r/1112184

Change #1112184 merged by Jcrespo:

[operations/puppet@production] installserver: Review backup and db hosts

https://gerrit.wikimedia.org/r/1112184

Icinga downtime and Alertmanager silence (ID=2ec27167-237c-4fd7-9ccb-4486e0a3234c) set by jynus@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: reimage

db2141.codfw.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by jynus@cumin2002 for host db2141.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jynus@cumin2002 for host db2141.codfw.wmnet with OS bookworm completed:

  • db2141 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501171201_jynus_262088_db2141.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Icinga status is not optimal, downtime not removed

That was because I am rebuilding the tables and thus replication is stopped.

Icinga downtime and Alertmanager silence (ID=d7245c24-f67b-4f43-ae17-b2ef80f610ed) set by jynus@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: os upgrade

db1245.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host db1245.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host db1245.eqiad.wmnet with OS bookworm completed:

  • db1245 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501201054_root_52321_db1245.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Change #1112802 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Remove set user permissions from m1 backup user grants

https://gerrit.wikimedia.org/r/1112802

Icinga downtime and Alertmanager silence (ID=f63b7dc3-cd57-40da-aafb-d98d09fe8ad8) set by jynus@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: os upgrade

db1240.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host db1240.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host db1240.eqiad.wmnet with OS bookworm completed:

  • db1240 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501220940_root_844917_db1240.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=8324c4ec-57d8-4460-ab4e-364dd23824da) set by jynus@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: os upgrade

db1205.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host db1205.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host db1205.eqiad.wmnet with OS bookworm completed:

  • db1205 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501221225_root_909257_db1205.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2025-01-23T10:57:52Z] <jynus> pausing media backups on eqiad for maintenance T383902

Icinga downtime and Alertmanager silence (ID=8968cdca-1368-4a8b-8d7b-88380f4a6dfe) set by jynus@cumin1002 for 4:00:00 on 1 host(s) and their services with reason: os upgrade

db1204.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=8e426345-55db-4d8d-97b1-479f633bd115) set by jynus@cumin1002 for 4:00:00 on 1 host(s) and their services with reason: os upgrade

db1205.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host db1204.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host db1204.eqiad.wmnet with OS bookworm completed:

  • db1204 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501231126_root_1355894_db1204.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=6f470676-dea0-4f69-80aa-826d9c313d20) set by jynus@cumin1002 for 2:00:00 on 1 host(s) and their services with reason: reimage

db1239.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host db1239.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host db1239.eqiad.wmnet with OS bookworm completed:

  • db1239 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501231410_root_1505629_db1239.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=9b068a8d-a257-44ae-b3a9-aa92c844556a) set by jynus@cumin1002 for 6:00:00 on 1 host(s) and their services with reason: os upgrade

db1225.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host db1225.eqiad.wmnet with OS bookworm

Icinga downtime and Alertmanager silence (ID=fef62d94-be78-4342-b8f7-11aec550c58d) set by jynus@cumin1002 for 4:00:00 on 1 host(s) and their services with reason: os upgrade

db1216.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host db1216.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host db1225.eqiad.wmnet with OS bookworm completed:

  • db1225 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501240825_root_1884836_db1225.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host db1216.eqiad.wmnet with OS bookworm completed:

  • db1216 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501240851_root_1888873_db1216.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=850d9949-25a8-4e40-abfc-8a9183ee0279) set by jynus@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their services with reason: rebuilding tables

db1216.eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=da5694f7-58b7-4217-8b3e-f5bde76d7490) set by jynus@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their services with reason: reimage

db1171.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host db1171.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host db1171.eqiad.wmnet with OS bookworm executed with errors:

  • db1171 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console db1171.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host db1171.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host db1171.eqiad.wmnet with OS bookworm completed:

  • db1171 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501271216_root_2535731_db1171.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=cd3eaa16-18c2-4399-afd5-a1186d100dc4) set by jynus@cumin1002 for 3 days, 0:00:00 on 1 host(s) and their services with reason: reimage

db1150.eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by root@cumin1002 for host db1150.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by root@cumin1002 for host db1150.eqiad.wmnet with OS bookworm completed:

  • db1150 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202501280926_root_2794567_db1150.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Icinga downtime and Alertmanager silence (ID=ac63cc5e-30f2-4726-b368-8ac5c3ba5641) set by jynus@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: test new s4 backups

db2201.codfw.wmnet

Icinga downtime and Alertmanager silence (ID=0f2ee58b-8e61-458b-8036-54ff80c6dc78) set by jynus@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: prepare for decom

db2202.codfw.wmnet

Change #1115774 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Decommission db2139

https://gerrit.wikimedia.org/r/1115774

Change #1115774 merged by Jcrespo:

[operations/puppet@production] dbbackups: Decommission db2139

https://gerrit.wikimedia.org/r/1115774

This is technically all done (all hosts have been upgraded), but I still need to review grants and hopefully remove and unify them among all hosts.

Change #1112802 merged by Jcrespo:

[operations/puppet@production] dbbackups: Fix dump grants for backup sources and m1

https://gerrit.wikimedia.org/r/1112802

Change #1116845 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Update grants for misc hosts other than m1

https://gerrit.wikimedia.org/r/1116845

Change #1116846 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Remove last references to dbprov[12]00[12]

https://gerrit.wikimedia.org/r/1116846

Mentioned in SAL (#wikimedia-operations) [2025-02-04T11:48:11Z] <jynus> deploying new backup grants for matomo and analytics_meta T383902

Mentioned in SAL (#wikimedia-operations) [2025-02-04T12:38:18Z] <jynus> deploying new backup grants for ES hosts T383902

Change #1117182 had a related patch set uploaded (by Jcrespo; author: Jcrespo):

[operations/puppet@production] dbbackups: Fix m5 backup grant issues

https://gerrit.wikimedia.org/r/1117182

Change #1116845 merged by Jcrespo:

[operations/puppet@production] dbbackups: Update grants for misc hosts other than m1

https://gerrit.wikimedia.org/r/1116845

Change #1117182 merged by Jcrespo:

[operations/puppet@production] dbbackups: Fix m5 backup grant issues

https://gerrit.wikimedia.org/r/1117182

This is now done, all backup user grants reviewed & updated.

Change #1116846 merged by Jcrespo:

[operations/puppet@production] dbbackups: Remove last references to dbprov[12]00[12]

https://gerrit.wikimedia.org/r/1116846