Page MenuHomePhabricator

Service implementation for elastic2087-2109
Closed, ResolvedPublic

Description

Creating this ticket to:

  • Bring hosts elastic2087-2109 into service: 5 net-new hosts, 18 refresh
  • Decom elastic20[37-54]

Puppet code to enable Puppet 7 on these new hosts was added here

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2096.codfw.wmnet with OS bullseye completed:

  • elastic2096 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401190046_bking_2055329_elastic2096.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2097.codfw.wmnet with OS bullseye completed:

  • elastic2097 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401190050_bking_2055850_elastic2097.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1104.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

  • elastic2094 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1105.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1106.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

  • elastic2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1104.eqiad.wmnet with OS bullseye completed:

  • elastic1104 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401190224_bking_2077517_elastic1104.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1105.eqiad.wmnet with OS bullseye completed:

  • elastic1105 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401190228_bking_2077928_elastic1105.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1106.eqiad.wmnet with OS bullseye completed:

  • elastic1106 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401190231_bking_2078401_elastic1106.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye executed with errors:

  • elastic1103 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

  • elastic2094 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic1107.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1103.eqiad.wmnet with OS bullseye completed:

  • elastic1103 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401191420_bking_2209969_elastic1103.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic1107.eqiad.wmnet with OS bullseye completed:

  • elastic1107 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401191433_bking_2210048_elastic1107.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Cleared switch DHCP cache and MAC table for the host IP and MAC (EVPN Switch)

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

  • elastic2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

  • elastic2094 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

  • elastic2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

  • elastic2094 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

  • elastic2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

  • elastic2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2088.codfw.wmnet with OS bullseye executed with errors:

  • elastic2088 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host elastic2094.codfw.wmnet with OS bullseye executed with errors:

  • elastic2094 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2104.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2106.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2105.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2106.codfw.wmnet with OS bullseye completed:

  • elastic2106 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401242233_ryankemper_1189954_elastic2106.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye executed with errors:

  • elastic2103 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2104.codfw.wmnet with OS bullseye executed with errors:

  • elastic2104 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2105.codfw.wmnet with OS bullseye executed with errors:

  • elastic2105 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin2002 for host elastic2103.codfw.wmnet with OS bullseye completed:

  • elastic2103 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202401242354_ryankemper_1226680_elastic2103.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@bking elastic2088 is now ready for the next step.

elastic2094 is still showing an error and needs further investigation.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1008.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1008.eqiad.wmnet with OS bullseye completed:

  • cloudelastic1008 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202402091814_bking_3707376_cloudelastic1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully

Change 1007969 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: add elastic2088-2109 to production role

https://gerrit.wikimedia.org/r/1007969

Change 1007969 merged by Bking:

[operations/puppet@production] elastic: add elastic2088-2109 to production role

https://gerrit.wikimedia.org/r/1007969

I added elastic2088-2109 to the production roles and ran puppet, however:

  • elastic2104, 2105 and 2106 are not reachable via SSH.

Change 1008528 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: move elastic2107 back to insetup

https://gerrit.wikimedia.org/r/1008528

Change 1008528 merged by Bking:

[operations/puppet@production] elastic: move elastic2107 and 2108 back to insetup

https://gerrit.wikimedia.org/r/1008528

Change #1013395 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic: Bring elastic2107/2108 into service

https://gerrit.wikimedia.org/r/1013395

Change #1013395 merged by Bking:

[operations/puppet@production] elastic: Bring elastic2107/2108 into service

https://gerrit.wikimedia.org/r/1013395

Change #1013398 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elastic-codfw: Add new master-eligibles

https://gerrit.wikimedia.org/r/1013398

Change #1013398 merged by Bking:

[operations/puppet@production] elastic-codfw: Add new master-eligibles

https://gerrit.wikimedia.org/r/1013398

Mentioned in SAL (#wikimedia-operations) [2024-03-21T20:35:10Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-03-21T20:37:05Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-03-21T21:03:59Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-03-21T22:39:19Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: introduce new masters - bking@cumin2002 - T353878

elastic2088 is unreachable and reported as missing from PuppetDB by Netbox report. No host should be powered on with puppet disabled or not working for longer period of time. Please either reimage it or shut it down now and reimage it at a later stage (before powering it on).

elastic2088 is unreachable and reported as missing from PuppetDB by Netbox report. No host should be powered on with puppet disabled or not working for longer period of time. Please either reimage it or shut it down now and reimage it at a later stage (before powering it on).

I think it wasn't logged to this ticket, but we tried kicking off a reimage of elastic2088 yesterday. From SAL:

END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host elastic2088.codfw.wmnet with OS bullseye

Will likely need to open a ticket with dc-ops. For now, I've powered it off through the DRAC via serveraction powerdown.

Mentioned in SAL (#wikimedia-operations) [2024-03-28T19:48:56Z] <ryankemper> T353878 Updated cross cluster remote seed conf with latest master info: ryankemper@mwmaint1002:~/elastic$ python push_cross_cluster_conf.py https://search.svc.codfw.wmnet:9443/_cluster/settings --ccc chi=chi_codfw_masters.lst psi=psi_codfw_masters.lst omega=omega_codfw_masters.lst

Change #1015379 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elasticsearch: remove elastic2090 from psi cluster

https://gerrit.wikimedia.org/r/1015379

Change #1015381 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] elastic: move elastic2088 to insetup

https://gerrit.wikimedia.org/r/1015381

Change #1015381 merged by Bking:

[operations/puppet@production] elastic: move elastic2088 to insetup

https://gerrit.wikimedia.org/r/1015381

Change #1015379 merged by Bking:

[operations/puppet@production] elasticsearch: remove elastic2090 from psi cluster

https://gerrit.wikimedia.org/r/1015379

Mentioned in SAL (#wikimedia-operations) [2024-03-28T20:07:47Z] <ryankemper@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2090* for ban elastic2090 before reimage - ryankemper@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-03-28T20:07:53Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2090* for ban elastic2090 before reimage - ryankemper@cumin2002 - T353878

All hosts in scope for implementation are now part of our production elastic cluster, EXCEPT elastic2088 which has hardware problems (tracked in T361525 ). Closing...

Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:50:06Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2090 for reboot to get rid of broken systemd units - bking@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:50:10Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic2090 for reboot to get rid of broken systemd units - bking@cumin2002 - T353878

Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:51:44Z] <bking@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on elastic2090.codfw.wmnet with reason: T353878

Mentioned in SAL (#wikimedia-operations) [2024-04-12T15:51:59Z] <bking@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on elastic2090.codfw.wmnet with reason: T353878