Page MenuHomePhabricator

Upgrade s4 to MariaDB 10.6
Open, MediumPublic

Description

  • dbstore1007
  • db2219
  • db2210
  • db2206
  • db2199
  • db2187
  • db2179 master T363688
  • db2172
  • db2155 sanitarium master
  • db2147
  • db2140
  • db2139
  • db2137
  • db2136
  • db2119 host does not exist anymore
  • db2110 host does not exist anymore
  • db2106 host does not exist anymore
  • db2099 dbstore host @jcrespo will handle it
  • db1249
  • db1248
  • db1247
  • db1245 dbstore host @jcrespo will handle it
  • db1244
  • db1243
  • db1242
  • db1241
  • db1238 master T363689
  • db1221 sanitarium master
  • db1199
  • db1190
  • db1160
  • db1155
  • db1150 dbstore host @jcrespo will handle it
  • db1125 (to be ignored will be decommissioned)
  • clouddb1021
  • clouddb1019
  • clouddb1015

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Marostegui triaged this task as Medium priority.
Marostegui moved this task from Triage to Ready on the DBA board.

Mentioned in SAL (#wikimedia-operations) [2024-04-23T08:26:44Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db2206.codfw.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-23T08:26:57Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2206.codfw.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-23T08:42:26Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db2172.codfw.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-23T08:42:39Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2172.codfw.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2172.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2172.codfw.wmnet with OS bookworm completed:

  • db2172 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404230902_arnaudb_83290_db2172.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-23T12:57:31Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db2147.codfw.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-23T12:57:44Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2147.codfw.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2147.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2147.codfw.wmnet with OS bookworm completed:

  • db2147 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231317_arnaudb_238858_db2147.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-04-23T13:41:34Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db2140.codfw.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-23T13:41:51Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2140.codfw.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2140.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2140.codfw.wmnet with OS bookworm completed:

  • db2140 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231404_arnaudb_278094_db2140.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-23T14:26:06Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db2136.codfw.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-23T14:26:25Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2136.codfw.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2136.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2136.codfw.wmnet with OS bookworm completed:

  • db2136 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231449_arnaudb_326219_db2136.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2136.codfw.wmnet with OS bookworm executed with errors:

  • db2136 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404231449_arnaudb_326219_db2136.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2136.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Mentioned in SAL (#wikimedia-operations) [2024-04-24T08:36:41Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db1248.eqiad.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-24T08:36:58Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1248.eqiad.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1248.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1248.eqiad.wmnet with OS bookworm completed:

  • db1248 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404240855_arnaudb_461535_db1248.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-24T09:24:39Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db1247.eqiad.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-24T09:24:52Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1247.eqiad.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1247.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1247.eqiad.wmnet with OS bookworm completed:

  • db1247 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404240944_arnaudb_468761_db1247.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-24T12:24:46Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db1242.eqiad.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-24T12:24:59Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1242.eqiad.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1242.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1242.eqiad.wmnet with OS bookworm completed:

  • db1242 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404241242_arnaudb_492048_db1242.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-24T13:17:22Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db1199.eqiad.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-24T13:17:42Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1199.eqiad.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1199.eqiad.wmnet with OS bookworm

Yeah, no issue with that. The reimage cookbook will parse the configured Puppet version and automatically pick 5 or 7.

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1199.eqiad.wmnet with OS bookworm completed:

  • db1199 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404241334_arnaudb_500972_db1199.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-04-24T14:13:19Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db1190.eqiad.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-24T14:13:33Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1190.eqiad.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1190.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1190.eqiad.wmnet with OS bookworm completed:

  • db1190 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404241432_arnaudb_510726_db1190.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-04-25T07:44:19Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db1241.eqiad.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-25T07:44:35Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1241.eqiad.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1241.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1241.eqiad.wmnet with OS bookworm completed:

  • db1241 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404250804_arnaudb_640700_db1241.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-25T08:40:13Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db1160.eqiad.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-25T08:40:31Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1160.eqiad.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1160.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1160.eqiad.wmnet with OS bookworm completed:

  • db1160 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404250857_arnaudb_649389_db1160.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-25T12:03:10Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on db[2155,2187].codfw.wmnet with reason: T362746

Mentioned in SAL (#wikimedia-operations) [2024-04-25T12:03:43Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db[2155,2187].codfw.wmnet with reason: T362746

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2155.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2155.codfw.wmnet with OS bookworm executed with errors:

  • db2155 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2155.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2155.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2155.codfw.wmnet with OS bookworm executed with errors:

  • db2155 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2155.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2155.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2155.codfw.wmnet with OS bullseye executed with errors:

  • db2155 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" db2155.codfw.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2155.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2155.codfw.wmnet with OS bookworm

The NIC in this host is running firmware version 22.0.7.60, which we is known to cause issues (specifically the link doesn't come up once the debain installer environment has loaded following PXEboot).

The firmware for this NIC should be downgraded to 21.85.21.92 and reimage tried again.

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2155.codfw.wmnet with OS bookworm completed:

  • db2155 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404251507_arnaudb_971572_db2155.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

The NIC in this host is running firmware version 22.0.7.60, which we is known to cause issues (specifically the link doesn't come up once the debain installer environment has loaded following PXEboot).

The firmware for this NIC should be downgraded to 21.85.21.92 and reimage tried again.

good catch! at this point I think it's not a concern anymore. Unless it has further implications?

good catch! at this point I think it's not a concern anymore. Unless it has further implications?

No, the fact it worked means we didn't hit the issue we have seen before with this, so no further action needed.

I note the 10G NIC model in this box is a BCM57412, I think perhaps the issue we seen before only applies to BCM57810, I'll maybe look more into that and discuss with DC-Ops but not relevant to this task. Thanks!

good catch! at this point I think it's not a concern anymore. Unless it has further implications?

No, the fact it worked means we didn't hit the issue we have seen before with this, so no further action needed.

I note the 10G NIC model in this box is a BCM57412, I think perhaps the issue we seen before only applies to BCM57810, I'll maybe look more into that and discuss with DC-Ops but not relevant to this task. Thanks!

To clarify the issue we have observed with firmware 22.x is with the BCM57412, however it affects bullseye installations, this one is bookworm hence we were ok.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2179.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2179.codfw.wmnet with OS bookworm completed:

  • db2179 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404291307_arnaudb_3047981_db2179.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB