⚓ T353878 Service implementation for elastic2087-2109

Subject	Repo	Branch	Lines +/-
elasticsearch: remove elastic2090 from psi cluster	operations/puppet	production	+0 -1
elastic: move elastic2088 to insetup	operations/puppet	production	+5 -0
elastic-codfw: Add new master-eligibles	operations/puppet	production	+3 -0
elastic: Bring elastic2107/2108 into service	operations/puppet	production	+1 -6
elastic: add elastic2088-2109 to production role	operations/puppet	production	+2 -28
elastic: move elastic2107 and 2108 back to insetup	operations/puppet	production	+9 -1
elastic: prepare new hosts	operations/puppet	production	+86 -16

Status	Subtype	Assigned	Task
Open		None	T353392 Ensure Elastic stack works on bookworm
Resolved		bking	T353878 Service implementation for elastic2087-2109
Resolved		BTullis	T355830 Hardware error on elastic2094 - Comm Error: Backplane 0.
Resolved		bking	T358882 Decommission elastic2037-2054
Resolved	Request	bking	T313842 Decommission elastic2049.codfw.wmnet
Resolved	Request	Jhancock.wm	T361305 decommission elastic20[37-54].codfw.wmnet

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2096.codfw.wmnet with OS bullseye completed:

elastic2096 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401190046_bking_2055329_elastic2096.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2097.codfw.wmnet with OS bullseye completed:

elastic2097 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401190050_bking_2055850_elastic2097.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1104.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

elastic2094 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1105.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1106.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

elastic2088 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1104.eqiad.wmnet with OS bullseye completed:

elastic1104 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401190224_bking_2077517_elastic1104.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1105.eqiad.wmnet with OS bullseye completed:

elastic1105 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401190228_bking_2077928_elastic1105.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1106.eqiad.wmnet with OS bullseye completed:

elastic1106 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401190231_bking_2078401_elastic1106.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye executed with errors:

elastic1103 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

elastic2094 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1107.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye completed:

elastic1103 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401191420_bking_2209969_elastic1103.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1107.eqiad.wmnet with OS bullseye completed:

elastic1107 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401191433_bking_2210048_elastic1107.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

elastic2088 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

elastic2094 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

elastic2088 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

elastic2094 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

elastic2088 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye

bking moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.Jan 22 2024, 2:16 PM

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

elastic2088 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

elastic2088 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- The reimage failed, see the cookbook logs for the details

Gehel edited projects, added Data-Platform-SRE (2024.01.22 - 2024.02.11); removed Data-Platform-SRE (2024.01.01 - 2024.01.21).Jan 23 2024, 1:54 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.22 - 2024.02.11) board.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

elastic2094 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2104.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2106.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2105.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2106.codfw.wmnet with OS bullseye completed:

elastic2106 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401242233_ryankemper_1189954_elastic2106.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye executed with errors:

elastic2103 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2104.codfw.wmnet with OS bullseye executed with errors:

elastic2104 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2105.codfw.wmnet with OS bullseye executed with errors:

elastic2105 (FAIL)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye completed:

elastic2103 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401242354_ryankemper_1226680_elastic2103.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

bking moved this task from In Progress to Blocked / Waiting on the Data-Platform-SRE (2024.01.22 - 2024.02.11) board.Jan 26 2024, 6:11 PM

@bking elastic2088 is now ready for the next step.

elastic2094 is still showing an error and needs further investigation.

BTullis closed subtask T355830: Hardware error on elastic2094 - Comm Error: Backplane 0. as Resolved.Feb 6 2024, 3:46 PM

Gehel moved this task from Blocked / Waiting to In Progress on the Data-Platform-SRE (2024.01.22 - 2024.02.11) board.Feb 6 2024, 4:35 PM

Gehel edited projects, added Data-Platform-SRE (2024.02.12 - 2024.03.03); removed Data-Platform-SRE (2024.01.22 - 2024.02.11).Feb 9 2024, 10:45 AM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1008.eqiad.wmnet with OS bullseye completed:

cloudelastic1008 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402091814_bking_3707376_cloudelastic1008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
- Updated Netbox status planned -> active
- The sre.puppet.sync-netbox-hiera cookbook was run successfully

Gehel edited projects, added Data-Platform-SRE (2024.03.04 - 2024.03.24); removed Data-Platform-SRE (2024.02.12 - 2024.03.03).Mar 1 2024, 4:00 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.04 - 2024.03.24) board.Mar 1 2024, 4:21 PM

Change 1007969 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: add elastic2088-2109 to production role

https://gerrit.wikimedia.org/r/1007969

gerritbot added a project: Patch-For-Review.Mar 1 2024, 8:12 PM

Change 1007969 merged by Bking:

[operations/puppet@production] elastic: add elastic2088-2109 to production role

https://gerrit.wikimedia.org/r/1007969

Maintenance_bot removed a project: Patch-For-Review.Mar 4 2024, 3:31 PM

I added elastic2088-2109 to the production roles and ran puppet, however:

elastic2104, 2105 and 2106 are not reachable via SSH.

elastic2107 is still on bookworm, we need to reimage/rollback, but there's a hiera error preventing the cookbook from running. We probably need to move the host back to insetup and try again.

Change 1008528 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: move elastic2107 back to insetup

https://gerrit.wikimedia.org/r/1008528

gerritbot added a project: Patch-For-Review.Mar 4 2024, 7:46 PM

Change 1008528 merged by Bking:

[operations/puppet@production] elastic: move elastic2107 and 2108 back to insetup

https://gerrit.wikimedia.org/r/1008528

Maintenance_bot removed a project: Patch-For-Review.Mar 4 2024, 9:31 PM

bking mentioned this in T358029: Migrate selected Search Platform alerts from icinga search-platform team to prometheus data-platform team.Mar 21 2024, 4:23 PM

Change #1013395 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: Bring elastic2107/2108 into service

https://gerrit.wikimedia.org/r/1013395

gerritbot added a project: Patch-For-Review.Mar 21 2024, 7:37 PM

Change #1013395 merged by Bking:

[operations/puppet@production] elastic: Bring elastic2107/2108 into service

https://gerrit.wikimedia.org/r/1013395

Change #1013398 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic-codfw: Add new master-eligibles

https://gerrit.wikimedia.org/r/1013398

Change #1013398 merged by Bking:

[operations/puppet@production] elastic-codfw: Add new master-eligibles

https://gerrit.wikimedia.org/r/1013398

Mentioned in SAL (#wikimedia-operations) [2024-03-21T20:35:10Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-03-21T20:37:05Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-03-21T21:03:59Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-03-21T22:39:19Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878

Gehel edited projects, added Data-Platform-SRE (2024.03.25 - 2024.04.14); removed Data-Platform-SRE (2024.03.04 - 2024.03.24).Mar 22 2024, 8:45 AM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.03.25 - 2024.04.14) board.Mar 22 2024, 8:45 AM

bking closed subtask T358882: Decommission elastic2037-2054 as Invalid.Mar 26 2024, 4:50 PM

bking reopened subtask T358882: Decommission elastic2037-2054 as In Progress.

elastic2088 is unreachable and reported as missing from PuppetDB by Netbox report. No host should be powered on with puppet disabled or not working for longer period of time. Please either reimage it or shut it down now and reimage it at a later stage (before powering it on).

In T353878#9664756, @Volans wrote:

elastic2088 is unreachable and reported as missing from PuppetDB by Netbox report. No host should be powered on with puppet disabled or not working for longer period of time. Please either reimage it or shut it down now and reimage it at a later stage (before powering it on).

I think it wasn't logged to this ticket, but we tried kicking off a reimage of elastic2088 yesterday. From SAL:

END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host elastic2088.codfw.wmnet with OS bullseye

Will likely need to open a ticket with dc-ops. For now, I've powered it off through the DRAC via serveraction powerdown.

Mentioned in SAL (#wikimedia-operations) [2024-03-28T19:48:56Z] <ryankemper> T353878 Updated cross cluster remote seed conf with latest master info: ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9443/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

Change #1015379 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elasticsearch: remove elastic2090 from psi cluster

https://gerrit.wikimedia.org/r/1015379

Change #1015381 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: move elastic2088 to insetup

https://gerrit.wikimedia.org/r/1015381

Change #1015381 merged by Bking:

[operations/puppet@production] elastic: move elastic2088 to insetup

https://gerrit.wikimedia.org/r/1015381

Change #1015379 merged by Bking:

[operations/puppet@production] elasticsearch: remove elastic2090 from psi cluster

https://gerrit.wikimedia.org/r/1015379

Mentioned in SAL (#wikimedia-operations) [2024-03-28T20:07:47Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2090* for ban elastic2090 before reimage - ryankemper@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-03-28T20:07:53Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2090* for ban elastic2090 before reimage - ryankemper@cumin2002 - T353878

RKemper closed subtask T358882: Decommission elastic2037-2054 as Resolved.Tue, Apr 2, 6:11 AM

All hosts in scope for implementation are now part of our production elastic cluster, EXCEPT elastic2088 which has hardware problems (tracked in T361525 ). Closing...

Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:50:06Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2090 for reboot to get rid of broken systemd units - bking@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:50:10Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2090 for reboot to get rid of broken systemd units - bking@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:51:44Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on elastic2090.codfw.wmnet with reason: T353878

Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:51:59Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on elastic2090.codfw.wmnet with reason: T353878

Service implementation for elastic2087-2109
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	bking
	Dec 21 2023, 2:07 PM

Service implementation for elastic2087-2109Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Service implementation for elastic2087-2109
Closed, ResolvedPublic
Actions

Related Objects
Search...