Page MenuHomePhabricator

Upgrade s1 to Bullseye
Closed, ResolvedPublic

Description

Hosts handled elsewhere:

Special hosts:

  • db1118: old primary (T301312)
  • db1133: test host
  • db1106: sanitarium master
  • db1154: sanitarium
  • db2103: primary
  • db2072: sanitarium master
  • db2094: sanitarium
  • db2102: test host

Generated on cumin1001 with:

sudo python3 ~kormat/bin/cumin-report.py \
  "A:db-section-s1 or P{db1133.eqiad.wmnet or db2102.codfw.wmnet}" \
  "[ \$(lsb_release -sc) = 'bullseye' ] && echo 'OK' || { echo 'NOT OK'; exit 1; }"
  • clouddb1013.eqiad.wmnet
  • clouddb1017.eqiad.wmnet
  • clouddb1021.eqiad.wmnet
  • db1099.eqiad.wmnet
  • db1105.eqiad.wmnet
  • db1106.eqiad.wmnet
  • db1118.eqiad.wmnet
  • db1119.eqiad.wmnet
  • db1132.eqiad.wmnet
  • db1133.eqiad.wmnet
  • db1134.eqiad.wmnet
  • db1135.eqiad.wmnet
  • db1139.eqiad.wmnet
  • db1140.eqiad.wmnet
  • db1154.eqiad.wmnet
  • db1163.eqiad.wmnet
  • db1164.eqiad.wmnet
  • db1169.eqiad.wmnet
  • db1184.eqiad.wmnet
  • db2071.codfw.wmnet
  • db2072.codfw.wmnet
  • db2085.codfw.wmnet
  • db2088.codfw.wmnet
  • db2092.codfw.wmnet
  • db2094.codfw.wmnet
  • db2097.codfw.wmnet
  • db2102.codfw.wmnet
  • db2103.codfw.wmnet
  • db2112.codfw.wmnet
  • db2116.codfw.wmnet
  • db2130.codfw.wmnet
  • db2141.codfw.wmnet
  • db2145.codfw.wmnet
  • db2146.codfw.wmnet
  • dbstore1003.eqiad.wmnet

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2022-04-29T08:58:58Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Reboot T303171', diff saved to https://phabricator.wikimedia.org/P27006 and previous config saved to /var/cache/conftool/dbconfig/20220429-085858-kormat.json

Mentioned in SAL (#wikimedia-operations) [2022-04-29T09:14:02Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Reboot T303171', diff saved to https://phabricator.wikimedia.org/P27007 and previous config saved to /var/cache/conftool/dbconfig/20220429-091401-kormat.json

Mentioned in SAL (#wikimedia-operations) [2022-04-29T09:36:07Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:30:00 on db1164.eqiad.wmnet with reason: Rebooting for T303171

Mentioned in SAL (#wikimedia-operations) [2022-04-29T09:36:12Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1164.eqiad.wmnet with reason: Rebooting for T303171

Mentioned in SAL (#wikimedia-operations) [2022-04-29T09:36:18Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1164 depooling: Rebooting for T303171', diff saved to https://phabricator.wikimedia.org/P27008 and previous config saved to /var/cache/conftool/dbconfig/20220429-093613-kormat.json

Cookbook cookbooks.sre.hosts.reimage was started by kormat@cumin1001 for host db1164.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kormat@cumin1001 for host db1164.eqiad.wmnet with OS bullseye executed with errors:

  • db1164 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details
Kormat changed the task status from Open to Stalled.May 2 2022, 7:24 AM
Marostegui changed the task status from Stalled to Open.May 12 2022, 1:33 PM

T307198 was fixed, I will reimage db1164 so it doesn't get behind for many more days

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1164.eqiad.wmnet with OS bullseye

Change 791377 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1164: Disable notifications

https://gerrit.wikimedia.org/r/791377

Change 791377 merged by Marostegui:

[operations/puppet@production] db1164: Disable notifications

https://gerrit.wikimedia.org/r/791377

I have reimaged and started mysql again on db1164, it is 13 days behind the master :-/

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1164.eqiad.wmnet with OS bullseye completed:

  • db1164 (WARN)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205121334_marostegui_1255335_db1164.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

db1164 finally caught up, I will start repooling next week.

Change 796507 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1118: Disable notifications

https://gerrit.wikimedia.org/r/796507

Change 796507 merged by Marostegui:

[operations/puppet@production] db1118: Disable notifications

https://gerrit.wikimedia.org/r/796507

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1118.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1118.eqiad.wmnet with OS bullseye completed:

  • db1118 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205230506_marostegui_978453_db1118.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
Ladsgroup moved this task from Ready to In progress on the DBA board.
Ladsgroup added subscribers: Kormat, Ladsgroup.

Kormat is out sick. I quickly finish this.

Mentioned in SAL (#wikimedia-operations) [2022-05-23T10:12:23Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28307 and previous config saved to /var/cache/conftool/dbconfig/20220523-101222-ladsgroup.json

Change 797136 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1119: Disable notification

https://gerrit.wikimedia.org/r/797136

Change 797136 merged by Ladsgroup:

[operations/puppet@production] db1119: Disable notification

https://gerrit.wikimedia.org/r/797136

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1119.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1119.eqiad.wmnet with OS bullseye completed:

  • db1119 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231024_ladsgroup_1046863_db1119.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-05-23T11:00:44Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28312 and previous config saved to /var/cache/conftool/dbconfig/20220523-110043-ladsgroup.json

Change 797174 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1134: Disable notification

https://gerrit.wikimedia.org/r/797174

Change 797174 merged by Ladsgroup:

[operations/puppet@production] db1134: Disable notification

https://gerrit.wikimedia.org/r/797174

Mentioned in SAL (#wikimedia-operations) [2022-05-23T11:45:59Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28317 and previous config saved to /var/cache/conftool/dbconfig/20220523-114559-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-05-23T11:52:04Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28318 and previous config saved to /var/cache/conftool/dbconfig/20220523-115202-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1134.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1134.eqiad.wmnet with OS bullseye completed:

  • db1134 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231156_ladsgroup_1063149_db1134.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-05-23T12:39:44Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28321 and previous config saved to /var/cache/conftool/dbconfig/20220523-123944-ladsgroup.json

Change 797247 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1135: Disable notification

https://gerrit.wikimedia.org/r/797247

Change 797247 merged by Ladsgroup:

[operations/puppet@production] db1135: Disable notification

https://gerrit.wikimedia.org/r/797247

Mentioned in SAL (#wikimedia-operations) [2022-05-23T13:24:59Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28326 and previous config saved to /var/cache/conftool/dbconfig/20220523-132459-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-05-23T13:32:30Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28328 and previous config saved to /var/cache/conftool/dbconfig/20220523-133228-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1135.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1135.eqiad.wmnet with OS bullseye completed:

  • db1135 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231345_ladsgroup_1085481_db1135.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:22:02Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28339 and previous config saved to /var/cache/conftool/dbconfig/20220523-142202-ladsgroup.json

Change 797316 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1184: Disable notification

https://gerrit.wikimedia.org/r/797316

Change 797316 merged by Ladsgroup:

[operations/puppet@production] db1184: Disable notification

https://gerrit.wikimedia.org/r/797316

Mentioned in SAL (#wikimedia-operations) [2022-05-23T15:07:17Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28346 and previous config saved to /var/cache/conftool/dbconfig/20220523-150717-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-05-23T15:12:09Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28348 and previous config saved to /var/cache/conftool/dbconfig/20220523-151207-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1184.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1184.eqiad.wmnet with OS bullseye completed:

  • db1184 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231517_ladsgroup_1105614_db1184.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-05-23T16:01:06Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28353 and previous config saved to /var/cache/conftool/dbconfig/20220523-160105-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-05-23T16:46:21Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28357 and previous config saved to /var/cache/conftool/dbconfig/20220523-164621-ladsgroup.json

Change 797344 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1106: Disable notification

https://gerrit.wikimedia.org/r/797344

Change 797344 merged by Ladsgroup:

[operations/puppet@production] db1106: Disable notification

https://gerrit.wikimedia.org/r/797344

Mentioned in SAL (#wikimedia-operations) [2022-05-23T16:50:50Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28358 and previous config saved to /var/cache/conftool/dbconfig/20220523-165045-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1106.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1106.eqiad.wmnet with OS bullseye completed:

  • db1106 (WARN)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231659_ladsgroup_1128779_db1106.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-05-23T17:34:39Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28359 and previous config saved to /var/cache/conftool/dbconfig/20220523-173439-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-05-23T18:19:54Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28364 and previous config saved to /var/cache/conftool/dbconfig/20220523-181954-ladsgroup.json