Service implementation for elastic20[61-86].codfw.wmnet
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	RKemper
	Feb 4 2022, 2:08 AM

Description

(elastic20[61-72]) See T291654 for procurement, T294154 for racking. These 12 refresh hosts are replacing elastic20[25-36].
(elastic20[73-86]) See T299608 for procurement, T299608 for racking. These are 14 new expansion hosts.

Step 1: Set up hieradata

allocate between psi/omega, keeping rows as balanced as possible

Step 2: Enable cirrus roles

after completion of this step, the new hosts should have joined the cirrus elasticsearch clusters

Step 3: Prepare to decom old hosts

set new master configuration - https://phabricator.wikimedia.org/T294805#7473840
manually ban from cluster
set new replication master seeds (if masters changed) https://phabricator.wikimedia.org/T294805#7701855

Step 4: Actually decom hosts (elastic20[25-36])

remove cirrus role and run decom cookbooks; then open decom tickets for dc-op

Details

Subject	Repo	Branch	Lines +/-
elastic: decom elastic20[25-36]	operations/puppet	production	+5 -82
elastic: bring new hosts into elastic cluster	operations/puppet	production	+1 -1
elastic: prepare to add new codfw hosts	operations/puppet	production	+63 -21
elastic: add conftool entries for new hosts	operations/puppet	production	+12 -0
elastic: enable elastic20[64-72] cirrus roles	operations/puppet	production	+4 -4
elastic: prep to bring elastic20[64-72] in	operations/puppet	production	+33 -5
elastic: add rack info for 3 new hosts	operations/puppet	production	+6 -1
elastic: bring 3 hosts in for extra capacity	operations/puppet	production	+12 -1

Customize query in gerrit

Related Objects

Mentioned In: T317816: Enable 10G networking in cirrus elastic clusters
T309810: Service implementation for elastic1[084-102].eqiad.wmnet
T300944: elastic2035 disk space critical (3% remaining)
Mentioned Here: T321243: decommission elastic20[25-36].codfw.wmnet
T299608: Q3:(Need By: TBD) rack/setup/install elastic20[73-86]
T294154: Q2:(Need By: TBD) rack/setup/install elastic20[61-72]
T294805: Service implementation for elastic10[68-83].eqiad.wmnet

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

RKemper mentioned this in T300944: elastic2035 disk space critical (3% remaining).Feb 4 2022, 2:09 AM

MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Feb 7 2022, 4:35 PM

MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

MPhamWMF set the point value for this task to 5.Feb 14 2022, 4:42 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

RKemper updated the task description. (Show Details)Jun 2 2022, 7:42 PM

RKemper renamed this task from Service implementation for elastic20[61-72].codfw.wmnet to Service implementation for elastic20[61-86].codfw.wmnet.Jun 2 2022, 8:14 PM

RKemper claimed this task.

RKemper updated the task description. (Show Details)

RKemper mentioned this in T309810: Service implementation for elastic1[084-102].eqiad.wmnet.

Mentioned in SAL (#wikimedia-operations) [2022-07-15T18:30:32Z] <ryankemper> T300943 Re-imaging elastic20[61-72] from buster -> bullseye, one host at a time. These hosts are not in service currently so re-imaging is safe.

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2061.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2061.codfw.wmnet with OS bullseye completed:

elastic2061 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207151830_ryankemper_3624114_elastic2061.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2062.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2062.codfw.wmnet with OS bullseye completed:

elastic2062 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207151901_ryankemper_3628952_elastic2062.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2063.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2063.codfw.wmnet with OS bullseye completed:

elastic2063 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207152038_ryankemper_3643602_elastic2063.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2064.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2064.codfw.wmnet with OS bullseye completed:

elastic2064 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207160013_ryankemper_3680318_elastic2064.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2065.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2065.codfw.wmnet with OS bullseye completed:

elastic2065 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207181743_ryankemper_1470187_elastic2065.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye executed with errors:

elastic2066 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye executed with errors:

elastic2066 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye executed with errors:

elastic2066 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2069.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2069.codfw.wmnet with OS bullseye completed:

elastic2069 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207191850_ryankemper_1802849_elastic2069.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye executed with errors:

elastic2066 (FAIL)
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Change 815778 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: bring 3 hosts in for extra capacity

https://gerrit.wikimedia.org/r/815778

gerritbot added a project: Patch-For-Review.Jul 20 2022, 6:22 PM

Change 815778 merged by Ryan Kemper:

[operations/puppet@production] elastic: bring 3 hosts in for extra capacity

https://gerrit.wikimedia.org/r/815778

Maintenance_bot removed a project: Patch-For-Review.Jul 20 2022, 6:30 PM

Change 815785 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: add rack info for 3 new hosts

https://gerrit.wikimedia.org/r/815785

gerritbot added a project: Patch-For-Review.Jul 20 2022, 6:37 PM

Change 815785 merged by Ryan Kemper:

[operations/puppet@production] elastic: add rack info for 3 new hosts

https://gerrit.wikimedia.org/r/815785

Maintenance_bot removed a project: Patch-For-Review.Jul 20 2022, 7:30 PM

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2067.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2068.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2070.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2071.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic2072.codfw.wmnet with OS bullseye

Mentioned in SAL (#wikimedia-operations) [2022-07-20T23:11:55Z] <ryankemper> T300943 Fixed IPMI passwords for elastic 20[67,68,70,71,72], reimaging them to bullseye (these hosts are not in service, thus the batch operation)

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2068.codfw.wmnet with OS bullseye completed:

elastic2068 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207202307_ryankemper_2383652_elastic2068.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2071.codfw.wmnet with OS bullseye completed:

elastic2071 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207202310_ryankemper_2384010_elastic2071.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2070.codfw.wmnet with OS bullseye completed:

elastic2070 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207202310_ryankemper_2384004_elastic2070.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2067.codfw.wmnet with OS bullseye completed:

elastic2067 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207202307_ryankemper_2383648_elastic2067.out
- Checked BIOS boot parameters are back to normal
- Unable to run puppet on puppetmaster2001.codfw.wmnet,puppetmaster1001.eqiad.wmnet to update configmaster.wikimedia.org with the new host SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic2072.codfw.wmnet with OS bullseye completed:

elastic2072 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207202310_ryankemper_2384019_elastic2072.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change 815823 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: prep to bring elastic20[64-72] in

https://gerrit.wikimedia.org/r/815823

gerritbot added a project: Patch-For-Review.Jul 21 2022, 1:00 AM

Change 815823 merged by Ryan Kemper:

[operations/puppet@production] elastic: prep to bring elastic20[64-72] in

https://gerrit.wikimedia.org/r/815823

Mentioned in SAL (#wikimedia-operations) [2022-07-21T15:50:41Z] <ryankemper> T300943 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/815823 and running puppet across elastic2* in preparation for adding new codfw hosts into service

Change 816008 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: enable elastic20[64-72] cirrus roles

https://gerrit.wikimedia.org/r/816008

Mentioned in SAL (#wikimedia-operations) [2022-07-21T16:09:52Z] <ryankemper> T300943 Merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/816008 and running puppet twice on elastic20[64-72]

Change 816008 merged by Ryan Kemper:

[operations/puppet@production] elastic: enable elastic20[64-72] cirrus roles

https://gerrit.wikimedia.org/r/816008

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye

Change 816017 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: add conftool entries for new hosts

https://gerrit.wikimedia.org/r/816017

Change 816017 merged by Ryan Kemper:

[operations/puppet@production] elastic: add conftool entries for new hosts

https://gerrit.wikimedia.org/r/816017

Mentioned in SAL (#wikimedia-operations) [2022-07-21T16:58:58Z] <ryankemper> T300943 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/816017 to get conftool-data entries for new elastic2* hosts

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye executed with errors:

elastic2066 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2066.codfw.wmnet with OS bullseye completed:

elastic2066 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202207211849_bking_2682719_elastic2066.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic1058.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic1058.eqiad.wmnet with OS bullseye completed:

elastic1058 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208091450_bking_3133870_elastic1058.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic1069.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic1069.eqiad.wmnet with OS bullseye completed:

elastic1069 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208091530_bking_3143357_elastic1069.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic1072.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic1072.eqiad.wmnet with OS bullseye completed:

elastic1072 (WARN)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208091700_bking_3159778_elastic1072.out
- Checked BIOS boot parameters are back to normal
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Gehel moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Aug 15 2022, 3:12 PM

RKemper added a comment.Aug 23 2022, 7:25 PM

This comment was removed by RKemper.

Change 829052 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: prepare to add new hosts

https://gerrit.wikimedia.org/r/829052

Change 829052 merged by Bking:

[operations/puppet@production] elastic: prepare to add new codfw hosts

https://gerrit.wikimedia.org/r/829052

Change 829056 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: bring new hosts into elastic cluster

https://gerrit.wikimedia.org/r/829056

Change 829056 merged by Bking:

[operations/puppet@production] elastic: bring new hosts into elastic cluster

https://gerrit.wikimedia.org/r/829056

Mentioned in SAL (#wikimedia-operations) [2022-09-01T20:35:13Z] <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 14 hosts with reason: T300943

Mentioned in SAL (#wikimedia-operations) [2022-09-01T20:35:34Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: T300943

Mentioned in SAL (#wikimedia-operations) [2022-09-01T20:40:15Z] <ryankemper> T300943 New hosts are in service and were pooled like so: sudo confctl select name=elastic20[73-86].* set/weight=10:pooled=yes (in retrospect that syntax seems to have selected too many hosts, but the final state of pybal is correct per https://config-master.wikimedia.org/pybal/codfw/search)

RKemper mentioned this in T317816: Enable 10G networking in cirrus elastic clusters.Sep 14 2022, 9:39 PM

Gehel updated the task description. (Show Details)Sep 29 2022, 6:40 PM

RKemper updated the task description. (Show Details)Oct 14 2022, 12:25 AM

Mentioned in SAL (#wikimedia-operations) [2022-10-14T00:36:36Z] <ryankemper> T300943 Decom'ing elastic20[25-36]. Decommissioning in batches by row, starting with row A (2025-27)

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic[2025-2027]

elastic2025 (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2026 (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2027 (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic[2028-2030]

elastic2028 (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2029 (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2030 (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Change 842547 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: decom elastic20[25-36]

https://gerrit.wikimedia.org/r/842547

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic[2031-2033].codfw.wmnet

elastic2031.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2032.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2033.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-10-14T02:01:12Z] <ryankemper> T300943 Final batch of decom'ing elastic20[25-36] => already decommissioned rows A/B/C; starting final row D (corresponding to 203[4,6])

Change 842547 merged by Ryan Kemper:

[operations/puppet@production] elastic: decom elastic20[25-36]

https://gerrit.wikimedia.org/r/842547

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic[2034,2036].codfw.wmnet

elastic2034.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2036.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

Mentioned in SAL (#wikimedia-operations) [2022-10-14T02:11:26Z] <ryankemper> T300943 Decom of elastic20[25-36] complete. Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/842547. This is done

Final step is to open dcops ticket, and then this can be moved to needs reporting.

(Need to step out so will open ticket when I get back)

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic[2025-2027].codfw.wmnet

elastic2025.codfw.wmnet (FAIL)
- No DNS record found for the mgmt interface elastic2025.mgmt.codfw.wmnet, trying the asset tag one: wmf6490.mgmt.codfw.wmnet
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2026.codfw.wmnet (FAIL)
- No DNS record found for the mgmt interface elastic2026.mgmt.codfw.wmnet, trying the asset tag one: wmf6491.mgmt.codfw.wmnet
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2027.codfw.wmnet (FAIL)
- No DNS record found for the mgmt interface elastic2027.mgmt.codfw.wmnet, trying the asset tag one: wmf6492.mgmt.codfw.wmnet
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by ryankemper@cumin2002 for hosts: elastic[2028-2030].codfw.wmnet

elastic2028.codfw.wmnet (FAIL)
- No DNS record found for the mgmt interface elastic2028.mgmt.codfw.wmnet, trying the asset tag one: wmf6493.mgmt.codfw.wmnet
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2029.codfw.wmnet (FAIL)
- No DNS record found for the mgmt interface elastic2029.mgmt.codfw.wmnet, trying the asset tag one: wmf6494.mgmt.codfw.wmnet
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

elastic2030.codfw.wmnet (FAIL)
- No DNS record found for the mgmt interface elastic2030.mgmt.codfw.wmnet, trying the asset tag one: wmf6495.mgmt.codfw.wmnet
- Downtimed host on Icinga/Alertmanager
- Found physical host
- Downtimed management interface on Icinga/Alertmanager
- Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2)
- Powered off
- [Netbox] Set status to Decommissioning, deleted all non-mgmt IPs, updated switch interfaces (disabled, removed vlans, etc
- Configured the linked switch interface(s)
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

ERROR: some step on some host failed, check the bolded items above

Decom is done: T321243

Note the failures of the decom cookbook above are not actual failures; they were decom'd successfully.

Gehel closed this task as Resolved.Nov 7 2022, 3:49 PM

Service implementation for elastic20[61-86].codfw.wmnetClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Service implementation for elastic20[61-86].codfw.wmnet
Closed, ResolvedPublic5 Estimated Story Points
Actions