Page MenuHomePhabricator

Upgrade s7 to bullseye
Closed, ResolvedPublic

Description

  • dbstore1003:3317 (T299481)
  • db2150
  • db2122
  • db2121 (codfw primary)
  • db2120
  • db2118
  • db2108
  • db2098:3317 (T299876)
  • db2095:3317
  • db2087:3317
  • db2086:3317
  • db2077
  • db1181
  • db1174
  • db1171:3317 (T299876)
  • db1170:3317
  • db1158
  • db1155:3317
  • db1136 (eqiad primary)
  • db1127
  • db1101:3317
  • db1098:3317
  • clouddb1021:3317 (T299480)
  • clouddb1018:3317 (T299480)
  • clouddb1014:3317 (T299480)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Mentioned in SAL (#wikimedia-operations) [2022-02-23T08:13:41Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db2108 (T302363)', diff saved to https://phabricator.wikimedia.org/P21337 and previous config saved to /var/cache/conftool/dbconfig/20220223-081338-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db2108.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db2108.codfw.wmnet with OS bullseye completed:

  • db2108 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202230816_ladsgroup_5808_db2108.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-23T08:57:55Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db2108 (T302363)', diff saved to https://phabricator.wikimedia.org/P21343 and previous config saved to /var/cache/conftool/dbconfig/20220223-085755-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-02-23T09:01:17Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db2077 (T302363)', diff saved to https://phabricator.wikimedia.org/P21345 and previous config saved to /var/cache/conftool/dbconfig/20220223-090109-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db2077.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db2077.codfw.wmnet with OS bullseye completed:

  • db2077 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202230903_ladsgroup_13739_db2077.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-23T09:46:56Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db2077 (T302363)', diff saved to https://phabricator.wikimedia.org/P21351 and previous config saved to /var/cache/conftool/dbconfig/20220223-094655-ladsgroup.json

Change 765236 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1181: Disable notifications

https://gerrit.wikimedia.org/r/765236

Change 765236 merged by Ladsgroup:

[operations/puppet@production] db1181: Disable notifications

https://gerrit.wikimedia.org/r/765236

Mentioned in SAL (#wikimedia-operations) [2022-02-23T11:05:43Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1181 (T302363)', diff saved to https://phabricator.wikimedia.org/P21359 and previous config saved to /var/cache/conftool/dbconfig/20220223-110540-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1181.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1181.eqiad.wmnet with OS bullseye completed:

  • db1181 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202231117_ladsgroup_29172_db1181.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-23T11:52:33Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302363)', diff saved to https://phabricator.wikimedia.org/P21365 and previous config saved to /var/cache/conftool/dbconfig/20220223-115233-ladsgroup.json

Change 765255 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1174: Disable notifications

https://gerrit.wikimedia.org/r/765255

Change 765255 merged by Ladsgroup:

[operations/puppet@production] db1174: Disable notifications

https://gerrit.wikimedia.org/r/765255

Mentioned in SAL (#wikimedia-operations) [2022-02-23T12:37:47Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1181 (T302363)', diff saved to https://phabricator.wikimedia.org/P21374 and previous config saved to /var/cache/conftool/dbconfig/20220223-123747-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-02-23T12:40:32Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1174 (T302363)', diff saved to https://phabricator.wikimedia.org/P21375 and previous config saved to /var/cache/conftool/dbconfig/20220223-124027-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1174.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1174.eqiad.wmnet with OS bullseye completed:

  • db1174 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202231244_ladsgroup_17864_db1174.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-23T13:38:59Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302363)', diff saved to https://phabricator.wikimedia.org/P21385 and previous config saved to /var/cache/conftool/dbconfig/20220223-133858-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-02-23T14:24:13Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1174 (T302363)', diff saved to https://phabricator.wikimedia.org/P21392 and previous config saved to /var/cache/conftool/dbconfig/20220223-142413-ladsgroup.json

Change 765308 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1127: Disable notifications

https://gerrit.wikimedia.org/r/765308

Change 765308 merged by Ladsgroup:

[operations/puppet@production] db1127: Disable notifications

https://gerrit.wikimedia.org/r/765308

Mentioned in SAL (#wikimedia-operations) [2022-02-23T16:44:56Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1127 (T302363)', diff saved to https://phabricator.wikimedia.org/P21403 and previous config saved to /var/cache/conftool/dbconfig/20220223-164453-ladsgroup.json

Dupe of T301653?

Oh, yes, you just said so. Never mind!

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1127.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1127.eqiad.wmnet with OS bullseye completed:

  • db1127 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202231648_ladsgroup_21330_db1127.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-23T17:22:07Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302363)', diff saved to https://phabricator.wikimedia.org/P21404 and previous config saved to /var/cache/conftool/dbconfig/20220223-172206-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-02-23T18:07:22Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1127 (T302363)', diff saved to https://phabricator.wikimedia.org/P21408 and previous config saved to /var/cache/conftool/dbconfig/20220223-180722-ladsgroup.json

Change 765316 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1158: Disable notifications

https://gerrit.wikimedia.org/r/765316

Change 765316 merged by Ladsgroup:

[operations/puppet@production] db1158: Disable notifications

https://gerrit.wikimedia.org/r/765316

Mentioned in SAL (#wikimedia-operations) [2022-02-23T18:13:56Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1158 (T302363)', diff saved to https://phabricator.wikimedia.org/P21409 and previous config saved to /var/cache/conftool/dbconfig/20220223-181350-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1158.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1158.eqiad.wmnet with OS bullseye completed:

  • db1158 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202231818_ladsgroup_5104_db1158.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-02-23T18:57:41Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1158 (T302363)', diff saved to https://phabricator.wikimedia.org/P21410 and previous config saved to /var/cache/conftool/dbconfig/20220223-185740-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-02-23T19:42:55Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1158 (T302363)', diff saved to https://phabricator.wikimedia.org/P21414 and previous config saved to /var/cache/conftool/dbconfig/20220223-194254-ladsgroup.json

Change 765489 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db2079: Disable notifications

https://gerrit.wikimedia.org/r/765489

Change 765539 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db2121: Disable notifications

https://gerrit.wikimedia.org/r/765539

Change 765539 merged by Ladsgroup:

[operations/puppet@production] db2121: Disable notifications

https://gerrit.wikimedia.org/r/765539

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db2121.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db2121.codfw.wmnet with OS bullseye completed:

  • db2121 (WARN)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202202241326_ladsgroup_19047_db2121.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

This is blocked on primary s7 switchover. db1101 can't be done now either because s8 instance can't be depooled due to dumper connecting to every db replica of s8 (T138208). I will do that as part of s8 upgrade anyway (T302185)

Change 784078 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1136: Disable notifications

https://gerrit.wikimedia.org/r/784078

Change 784078 merged by Marostegui:

[operations/puppet@production] db1136: Disable notifications

https://gerrit.wikimedia.org/r/784078

@Ladsgroup I will take care of db1136's reimage and close this task once done. I need to do lots of other maintenance to this host before the reimage.

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1136.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1136.eqiad.wmnet with OS bullseye completed:

  • db1136 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204191146_marostegui_3355307_db1136.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Failed to get Netbox script results, try manually: https://netbox.wikimedia.org/api/extras/job-results/2896452/

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1136.eqiad.wmnet with OS bullseye executed with errors:

  • db1136 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202204191146_marostegui_3355307_db1136.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB
    • Failed to get Netbox script results, try manually: https://netbox.wikimedia.org/api/extras/job-results/2896452/
    • The reimage failed, see the cookbook logs for the details
Marostegui updated the task description. (Show Details)

db1136, old s7 master, was done