⚓ T303171 Upgrade s1 to Bullseye

Subject	Repo	Branch	Lines +/-
db1106: Disable notification	operations/puppet	production	+2 -0
db1184: Disable notification	operations/puppet	production	+1 -0
db1135: Disable notification	operations/puppet	production	+1 -0
db1134: Disable notification	operations/puppet	production	+1 -0
db1119: Disable notification	operations/puppet	production	+1 -0
db1118: Disable notifications	operations/puppet	production	+1 -0
db1164: Disable notifications	operations/puppet	production	+1 -0

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	Marostegui	T298585 Upgrade WMF database-and-backup-related hosts to bullseye
Resolved	Ladsgroup	T303171 Upgrade s1 to Bullseye
Resolved	• Cmjohnson	T307198 db1164 fails to POST/boot/etc
Resolved	Ladsgroup	T301312 Switchover s1 master (db1118 -> db1163)
Resolved	Jclark-ctr	T308246 db1164 power supply isn't redundant

Mentioned in SAL (#wikimedia-operations) [2022-04-29T08:58:58Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: Reboot T303171', diff saved to https://phabricator.wikimedia.org/P27006 and previous config saved to /var/cache/conftool/dbconfig/20220429-085858-kormat.json

Mentioned in SAL (#wikimedia-operations) [2022-04-29T09:14:02Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1163 (re)pooling @ 100%: Reboot T303171', diff saved to https://phabricator.wikimedia.org/P27007 and previous config saved to /var/cache/conftool/dbconfig/20220429-091401-kormat.json

Mentioned in SAL (#wikimedia-operations) [2022-04-29T09:36:07Z] <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 1:30:00 on db1164.eqiad.wmnet with reason: Rebooting for T303171

Mentioned in SAL (#wikimedia-operations) [2022-04-29T09:36:12Z] <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1164.eqiad.wmnet with reason: Rebooting for T303171

Mentioned in SAL (#wikimedia-operations) [2022-04-29T09:36:18Z] <kormat@cumin1001> dbctl commit (dc=all): 'db1164 depooling: Rebooting for T303171', diff saved to https://phabricator.wikimedia.org/P27008 and previous config saved to /var/cache/conftool/dbconfig/20220429-093613-kormat.json

• Kormat updated the task description. (Show Details)Apr 29 2022, 9:36 AM

Cookbook cookbooks.sre.hosts.reimage was started by kormat@cumin1001 for host db1164.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by kormat@cumin1001 for host db1164.eqiad.wmnet with OS bullseye executed with errors:

db1164 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

• Kormat mentioned this in T307198: db1164 fails to POST/boot/etc.Apr 29 2022, 10:27 AM

Stalling until db1164 is back in service: T307198: db1164 fails to POST/boot/etc

• Kormat changed the task status from Open to Stalled.May 2 2022, 7:24 AM

• Cmjohnson closed subtask T307198: db1164 fails to POST/boot/etc as Resolved.May 11 2022, 7:19 PM

T307198 was fixed, I will reimage db1164 so it doesn't get behind for many more days

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1164.eqiad.wmnet with OS bullseye

Ladsgroup added a subtask: T301312: Switchover s1 master (db1118 -> db1163).May 12 2022, 1:38 PM

Change 791377 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1164: Disable notifications

https://gerrit.wikimedia.org/r/791377

Change 791377 merged by Marostegui:

[operations/puppet@production] db1164: Disable notifications

https://gerrit.wikimedia.org/r/791377

I have reimaged and started mysql again on db1164, it is 13 days behind the master :-/

Marostegui updated the task description. (Show Details)May 12 2022, 2:00 PM

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1164.eqiad.wmnet with OS bullseye completed:

db1164 (WARN)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205121334_marostegui_1255335_db1164.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Maintenance_bot removed a project: Patch-For-Review.May 12 2022, 2:31 PM

db1164 finally caught up, I will start repooling next week.

Ladsgroup closed subtask T301312: Switchover s1 master (db1118 -> db1163) as Resolved.May 19 2022, 6:19 AM

Marostegui updated the task description. (Show Details)May 20 2022, 6:37 AM

Jclark-ctr closed subtask T308246: db1164 power supply isn't redundant as Resolved.May 20 2022, 1:35 PM

Change 796507 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1118: Disable notifications

https://gerrit.wikimedia.org/r/796507

Change 796507 merged by Marostegui:

[operations/puppet@production] db1118: Disable notifications

https://gerrit.wikimedia.org/r/796507

Cookbook cookbooks.sre.hosts.reimage was started by marostegui@cumin1001 for host db1118.eqiad.wmnet with OS bullseye

Marostegui updated the task description. (Show Details)May 23 2022, 5:22 AM

Cookbook cookbooks.sre.hosts.reimage started by marostegui@cumin1001 for host db1118.eqiad.wmnet with OS bullseye completed:

db1118 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205230506_marostegui_978453_db1118.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Kormat is out sick. I quickly finish this.

Mentioned in SAL (#wikimedia-operations) [2022-05-23T10:12:23Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28307 and previous config saved to /var/cache/conftool/dbconfig/20220523-101222-ladsgroup.json

Change 797136 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1119: Disable notification

https://gerrit.wikimedia.org/r/797136

Change 797136 merged by Ladsgroup:

[operations/puppet@production] db1119: Disable notification

https://gerrit.wikimedia.org/r/797136

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1119.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1119.eqiad.wmnet with OS bullseye completed:

db1119 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231024_ladsgroup_1046863_db1119.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-05-23T11:00:44Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28312 and previous config saved to /var/cache/conftool/dbconfig/20220523-110043-ladsgroup.json

Ladsgroup updated the task description. (Show Details)May 23 2022, 11:04 AM

Change 797174 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1134: Disable notification

https://gerrit.wikimedia.org/r/797174

Change 797174 merged by Ladsgroup:

[operations/puppet@production] db1134: Disable notification

https://gerrit.wikimedia.org/r/797174

Mentioned in SAL (#wikimedia-operations) [2022-05-23T11:45:59Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1119 (T303171)', diff saved to https://phabricator.wikimedia.org/P28317 and previous config saved to /var/cache/conftool/dbconfig/20220523-114559-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-05-23T11:52:04Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28318 and previous config saved to /var/cache/conftool/dbconfig/20220523-115202-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1134.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1134.eqiad.wmnet with OS bullseye completed:

db1134 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231156_ladsgroup_1063149_db1134.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Ladsgroup updated the task description. (Show Details)May 23 2022, 12:31 PM

Mentioned in SAL (#wikimedia-operations) [2022-05-23T12:39:44Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28321 and previous config saved to /var/cache/conftool/dbconfig/20220523-123944-ladsgroup.json

Change 797247 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1135: Disable notification

https://gerrit.wikimedia.org/r/797247

Change 797247 merged by Ladsgroup:

[operations/puppet@production] db1135: Disable notification

https://gerrit.wikimedia.org/r/797247

Mentioned in SAL (#wikimedia-operations) [2022-05-23T13:24:59Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1134 (T303171)', diff saved to https://phabricator.wikimedia.org/P28326 and previous config saved to /var/cache/conftool/dbconfig/20220523-132459-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-05-23T13:32:30Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28328 and previous config saved to /var/cache/conftool/dbconfig/20220523-133228-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1135.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1135.eqiad.wmnet with OS bullseye completed:

db1135 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231345_ladsgroup_1085481_db1135.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-05-23T14:22:02Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28339 and previous config saved to /var/cache/conftool/dbconfig/20220523-142202-ladsgroup.json

Ladsgroup updated the task description. (Show Details)May 23 2022, 2:26 PM

Change 797316 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1184: Disable notification

https://gerrit.wikimedia.org/r/797316

Change 797316 merged by Ladsgroup:

[operations/puppet@production] db1184: Disable notification

https://gerrit.wikimedia.org/r/797316

Mentioned in SAL (#wikimedia-operations) [2022-05-23T15:07:17Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1135 (T303171)', diff saved to https://phabricator.wikimedia.org/P28346 and previous config saved to /var/cache/conftool/dbconfig/20220523-150717-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-05-23T15:12:09Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28348 and previous config saved to /var/cache/conftool/dbconfig/20220523-151207-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1184.eqiad.wmnet with OS bullseye

Ladsgroup updated the task description. (Show Details)May 23 2022, 3:41 PM

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1184.eqiad.wmnet with OS bullseye completed:

db1184 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231517_ladsgroup_1105614_db1184.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-05-23T16:01:06Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28353 and previous config saved to /var/cache/conftool/dbconfig/20220523-160105-ladsgroup.json

Mentioned in SAL (#wikimedia-operations) [2022-05-23T16:46:21Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1184 (T303171)', diff saved to https://phabricator.wikimedia.org/P28357 and previous config saved to /var/cache/conftool/dbconfig/20220523-164621-ladsgroup.json

Change 797344 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] db1106: Disable notification

https://gerrit.wikimedia.org/r/797344

Change 797344 merged by Ladsgroup:

[operations/puppet@production] db1106: Disable notification

https://gerrit.wikimedia.org/r/797344

Mentioned in SAL (#wikimedia-operations) [2022-05-23T16:50:50Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28358 and previous config saved to /var/cache/conftool/dbconfig/20220523-165045-ladsgroup.json

Cookbook cookbooks.sre.hosts.reimage was started by ladsgroup@cumin1001 for host db1106.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ladsgroup@cumin1001 for host db1106.eqiad.wmnet with OS bullseye completed:

db1106 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202205231659_ladsgroup_1128779_db1106.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-05-23T17:34:39Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28359 and previous config saved to /var/cache/conftool/dbconfig/20220523-173439-ladsgroup.json

Ladsgroup removed a project: Patch-For-Review.May 23 2022, 6:12 PM

Ladsgroup updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2022-05-23T18:19:54Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Repooling after maintenance db1106 (T303171)', diff saved to https://phabricator.wikimedia.org/P28364 and previous config saved to /var/cache/conftool/dbconfig/20220523-181954-ladsgroup.json

Ladsgroup closed this task as Resolved.May 23 2022, 6:25 PM

Maintenance_bot moved this task from In progress to Done on the DBA board.May 23 2022, 6:29 PM

Upgrade s1 to Bullseye
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	Marostegui
	Mar 7 2022, 12:06 PM

Upgrade s1 to BullseyeClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Upgrade s1 to Bullseye
Closed, ResolvedPublic
Actions

Related Objects
Search...