Page MenuHomePhabricator

Upgrade s5 to MariaDB 10.6
Closed, ResolvedPublic

Description

c.f. this comment in T356960

$ for i in $(sudo cumin --force --success-percentage 1 --no-progress --no-color 'db1183* or db1213* or db1200* or db1230* or db1161* or db1210* or db1185* or db2113* or db2123* or db2128* or db2211* or db2178* or db2157* or db2192* or db2171*' "grep 11.9 /etc/debian_version"|grep -i wmnet| nodeset -e -S '\n'); do echo [] $i ; done
15 hosts will be targeted:
db[2113,2123,2128,2157,2171,2178,2192,2211].codfw.wmnet,db[1161,1183,1185,1200,1210,1213,1230].eqiad.wmnet
FORCE mode enabled, continuing without confirmation
33.3% (5/15) of nodes failed to execute command 'grep 11.9 /etc/debian_version': db[2171,2192,2211].codfw.wmnet,db[1210,1213].eqiad.wmnet
66.7% (10/15) success ratio (>= 1.0% threshold) for command: 'grep 11.9 /etc/debian_version'.: db[2113,2123,2128,2157,2178].codfw.wmnet,db[1161,1183,1185,1200,1230].eqiad.wmnet
66.7% (10/15) success ratio (>= 1.0% threshold) of nodes successfully executed all commands.: db[2113,2123,2128,2157,2178].codfw.wmnet,db[1161,1183,1185,1200,1230].eqiad.wmnet
[] (10)
  • db2178.codfw.wmnet
  • db1200.eqiad.wmnet
  • db2157.codfw.wmnet
  • db1185.eqiad.wmnet
  • db1230.eqiad.wmnet
  • db1183.eqiad.wmnet - eqiad master T362668
  • db2123.codfw.wmnet - codfw candidate master
  • db2113.codfw.wmnet - codfw master
  • db2128.codfw.wmnet - sanitarium master
  • db2111.codfw.wmnet was missing on listing
  • db1161.eqiad.wmnet - sanitarium master
  • db1245.eqiad.wmnet

Event Timeline

ABran-WMF changed the task status from Open to Stalled.Mar 14 2024, 1:56 PM
ABran-WMF added a subscriber: Marostegui.

blocked by T357547

ABran-WMF triaged this task as Medium priority.Mar 18 2024, 8:01 AM

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2178.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2178.codfw.wmnet with OS bookworm completed:

  • db2178 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403281001_arnaudb_3544636_db2178.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1200.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1200.eqiad.wmnet with OS bookworm completed:

  • db1200 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403281321_arnaudb_3572026_db1200.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-03-28T14:18:46Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'Depool to reimage db2157 (T360116)', diff saved to https://phabricator.wikimedia.org/P58982 and previous config saved to /var/cache/conftool/dbconfig/20240328-141844-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-03-28T14:19:30Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db2157.codfw.wmnet with reason: T360116

Mentioned in SAL (#wikimedia-operations) [2024-03-28T14:19:44Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2157.codfw.wmnet with reason: T360116

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2157.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2157.codfw.wmnet with OS bookworm completed:

  • db2157 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202403281440_arnaudb_3580042_db2157.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)
Marostegui changed the task status from Stalled to Open.Mon, Apr 1, 5:36 AM
Marostegui moved this task from Blocked to Ready on the DBA board.

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1185.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1185.eqiad.wmnet with OS bookworm completed:

  • db1185 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404020932_arnaudb_230834_db1185.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1230.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1230.eqiad.wmnet with OS bookworm completed:

  • db1230 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404021501_arnaudb_285562_db1230.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-09T07:34:06Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2113 depool for reimage T360116', diff saved to https://phabricator.wikimedia.org/P59999 and previous config saved to /var/cache/conftool/dbconfig/20240409-073406-arnaudb.json

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2113.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2113.codfw.wmnet with OS bookworm completed:

  • db2113 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404090756_arnaudb_1521383_db2113.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-15T11:57:08Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db2128 depool T360116', diff saved to https://phabricator.wikimedia.org/P60498 and previous config saved to /var/cache/conftool/dbconfig/20240415-115708-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-04-15T11:58:14Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db[2128,2186].codfw.wmnet with reason: upgrade db2128 T360116

Mentioned in SAL (#wikimedia-operations) [2024-04-15T11:58:28Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2128,2186].codfw.wmnet with reason: upgrade db2128 T360116

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2128.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2128.codfw.wmnet with OS bookworm completed:

  • db2128 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404151232_arnaudb_2748166_db2128.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2111.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2111.codfw.wmnet with OS bookworm completed:

  • db2111 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404151406_arnaudb_2772344_db2111.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-04-16T07:35:22Z] <arnaudb@cumin1002> dbctl commit (dc=all): 'db1161 depool T360116', diff saved to https://phabricator.wikimedia.org/P60571 and previous config saved to /var/cache/conftool/dbconfig/20240416-073521-arnaudb.json

Mentioned in SAL (#wikimedia-operations) [2024-04-16T07:38:17Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on db1161.eqiad.wmnet with reason: T360116

Mentioned in SAL (#wikimedia-operations) [2024-04-16T07:38:30Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on db1161.eqiad.wmnet with reason: T360116

Mentioned in SAL (#wikimedia-operations) [2024-04-16T07:38:58Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 3:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db[1154,1161].eqiad.wmnet with reason: T360116

Mentioned in SAL (#wikimedia-operations) [2024-04-16T07:39:14Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db[1154,1161].eqiad.wmnet with reason: T360116

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1161.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1161.eqiad.wmnet with OS bookworm completed:

  • db1161 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404160756_arnaudb_2908886_db1161.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2024-04-16T13:36:02Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db2123.codfw.wmnet with reason: T360116

Mentioned in SAL (#wikimedia-operations) [2024-04-16T13:36:15Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2123.codfw.wmnet with reason: T360116

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db2123.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db2123.codfw.wmnet with OS bookworm completed:

  • db2123 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404161358_arnaudb_2964042_db2123.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Mentioned in SAL (#wikimedia-operations) [2024-04-18T05:57:28Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db1183.eqiad.wmnet with reason: upgrade db1183 T360116

Mentioned in SAL (#wikimedia-operations) [2024-04-18T05:57:32Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1183.eqiad.wmnet with reason: upgrade db1183 T360116

Cookbook cookbooks.sre.hosts.reimage was started by arnaudb@cumin1002 for host db1183.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by arnaudb@cumin1002 for host db1183.eqiad.wmnet with OS bookworm completed:

  • db1183 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202404180615_arnaudb_3346075_db1183.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
ABran-WMF updated the task description. (Show Details)
ABran-WMF updated the task description. (Show Details)
  • db1245.eqiad.wmnet

was missing from the initial inventory

Mentioned in SAL (#wikimedia-operations) [2024-04-23T10:25:50Z] <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on db1245.eqiad.wmnet with reason: T360116

Mentioned in SAL (#wikimedia-operations) [2024-04-23T10:26:03Z] <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1245.eqiad.wmnet with reason: T360116

db1245 is a backup source, will be discarded in a while