Page MenuHomePhabricator

Replace current Relforge servers with repurposed Elastic hosts
Closed, ResolvedPublic

Description

Per parent ticket, we've run into some issues with the Relforge hosts (mostly due to their age):

  • The current relforge hosts cannot reimage via cookbook, as it's an HP chassis (WMF hasn't bought them for years). Based on my work yesterday,
    • A manual reimage adds about 2 hours per server.
    • There are some delicate commands I have to run on the puppet server that I'd rather not.
  • 1G network. We have to shuffle 1.1 TB around every time we reimage a host.

The Opensearch migration is a risky endeavor. We need the freedom to reimage the relforge cluster multiple times if necessary, so we can make sure the process is repeatable before we move on to the production clusters. As such, I've elected to repurpose elastic1104-1106 as Relforge hosts.

The current relforge hosts are already slated to be replaced in T382906, so this will move up the timetable a bit. We have plenty of capacity in eqiad, so it's not a huge deal to lose 3 hosts which will be backfilled in the next quarter anyway.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2025-02-13T14:16:16Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1104*,elastic1005*,elastic1006* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357

Mentioned in SAL (#wikimedia-operations) [2025-02-13T14:16:19Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1104*,elastic1005*,elastic1006* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357

Change #1119520 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge/elastic: repurpose elastic hosts for relforge

https://gerrit.wikimedia.org/r/1119520

Mentioned in SAL (#wikimedia-operations) [2025-02-13T15:03:53Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1104*,elastic1105*,elastic1106* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357

Mentioned in SAL (#wikimedia-operations) [2025-02-13T15:03:56Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1104*,elastic1105*,elastic1106* for ban hosts prior to reimage/repurpose - bking@cumin2002 - T386357

Mentioned in SAL (#wikimedia-operations) [2025-02-13T15:19:32Z] <bking@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic[1104-1106].eqiad.wmnet with reason: T386357

Change #1119520 merged by Bking:

[operations/puppet@production] relforge/elastic: repurpose elastic hosts for relforge

https://gerrit.wikimedia.org/r/1119520

Cookbook cookbooks.sre.hosts.rename started by bking@cumin2002 from elastic1104 to relforge1005 completed:

  • elastic1104 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Disabled puppet and its timer
    • ✔️ Disabled debmonitor-client timer
    • ✔️ Netbox updated
    • ✔️ BMC Hostname updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by bking@cumin2002 from elastic1105 to relforge1006 completed:

  • elastic1105 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Disabled puppet and its timer
    • ✔️ Disabled debmonitor-client timer
    • ✔️ Netbox updated
    • ✔️ BMC Hostname updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.rename started by bking@cumin2002 from elastic1106 to relforge1007 completed:

  • elastic1106 (PASS)
    • ✔️ Downtimed host on Icinga/Alertmanager
    • ✔️ Disabled puppet and its timer
    • ✔️ Disabled debmonitor-client timer
    • ✔️ Netbox updated
    • ✔️ BMC Hostname updated
    • ✔️ DNS updated
    • ✔️ Switch description updated
    • ✔️ Removed from DebMonitor
    • ✔️ Removed from Puppet master and PuppetDB
    • Rename completed 👍 - now please run the re-image cookbook on the new name with --new

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host relforge1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host relforge1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host relforge1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host relforge1005.eqiad.wmnet with OS bullseye executed with errors:

  • relforge1005 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console relforge1005.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host relforge1006.eqiad.wmnet with OS bullseye executed with errors:

  • relforge1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console relforge1006.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host relforge1007.eqiad.wmnet with OS bullseye executed with errors:

  • relforge1007 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console relforge1007.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host relforge1005.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host relforge1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host relforge1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host relforge1005.eqiad.wmnet with OS bullseye completed:

  • relforge1005 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502131803_bking_4109053_relforge1005.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host relforge1007.eqiad.wmnet with OS bullseye executed with errors:

  • relforge1007 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console relforge1007.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host relforge1006.eqiad.wmnet with OS bullseye executed with errors:

  • relforge1006 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • The reimage failed, see the cookbook logs for the details. You can also try typing "sudo install-console relforge1006.eqiad.wmnet" to get a root shell, but depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host relforge1007.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host relforge1006.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host relforge1007.eqiad.wmnet with OS bullseye completed:

  • relforge1007 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502131858_bking_4137164_relforge1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host relforge1006.eqiad.wmnet with OS bullseye completed:

  • relforge1006 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202502131900_bking_4137302_relforge1006.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1119576 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] relforge: Prepare newly-reimaged relforge hosts to join the cluster

https://gerrit.wikimedia.org/r/1119576

Change #1119576 merged by Bking:

[operations/puppet@production] relforge: Prepare newly-reimaged relforge hosts to join the cluster

https://gerrit.wikimedia.org/r/1119576

Mentioned in SAL (#wikimedia-operations) [2025-02-13T22:44:39Z] <bking@cumin2002> DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on relforge[1003-1007].eqiad.wmnet with reason: T386357

I finished reimaging the above hosts, and they are now part of the cluster:

bking@relforge1005:~$ curl -s http://0:9200/_cat/nodes | sort -k10
10.64.5.37   42 99 1 0.58 0.73 0.64 dimr * relforge1003-relforge-eqiad
10.64.130.24  2 30 1 0.33 0.56 0.54 dimr - relforge1005-relforge-eqiad
10.64.152.2   3 31 1 0.65 0.60 0.46 dimr - relforge1006-relforge-eqiad
10.64.134.22  7 31 0 0.61 0.63 0.66 dimr - relforge1007-relforge-eqiad

As such, I'm closing out this ticket.