Page MenuHomePhabricator

Service implementation for cloudelastic1007-1010
Open, HighPublic

Description

cloudelastic1007-1010 are racked and ready to join the cluster. See Search Platform docs for the procedure (which might be outdated and need review).

AC

  • cloudelastic10[07-10] brought into service
  • decom cloudelastic100[1-4]
    • NOTE: Per hieradata/role/eqiad/elasticsearch/cloudelastic.yaml 1001, 1002, and 1004 are the current masters so we'll need to switch these entries to the new hosts

Event Timeline

Change 974693 had a related patch set uploaded (by Bking; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: bring cloudelastic10[07-10] into svc

https://gerrit.wikimedia.org/r/974693

Change 974693 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: bring cloudelastic10[07-10] into svc

https://gerrit.wikimedia.org/r/974693

Change 974694 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: hosts need racking info

https://gerrit.wikimedia.org/r/974694

Change 974694 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: hosts need racking info

https://gerrit.wikimedia.org/r/974694

Change 974696 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: switch new hosts back to insetup

https://gerrit.wikimedia.org/r/974696

Change 974696 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: switch new hosts back to insetup

https://gerrit.wikimedia.org/r/974696

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1008 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl-public"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl-public"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl-public"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • No changes in confctl are needed to restore the previous state.
  • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl-public"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl-public"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl-public"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • No changes in confctl are needed to restore the previous state.
  • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye completed:

  • cloudelastic1007 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311161027_jbond_1748240_cloudelastic1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye completed:

  • cloudelastic1008 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311161521_bking_2651612_cloudelastic1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@bking i took a look at cloudelastic1010 as i had thought this was in some broken state from the reimage cookbook. however from the puppet certs i can see its been around since Nov 9 07:30:40 2023 GMT and has had puppet disabled for the last 36 hours.

as a side note you shouldn't need to disable puppet when a server has the in-setup role and its bad to do so.

@jbond Sorry for the confusion, I associated the reimage with the wrong ticket. The output of the last reimage is here . Puppet was disabled because the hosts were previously set to their production role, but due to the PKI errors we put them back to insetup. I should have paid more attention...it seems the reimage never actually wiped the disks, whereas I had assumed it failed on later steps.

As far as what led to this situation, I'll try to recount in the hopes that it could be useful:

  • DC Ops did their typical host setup for cloudelastic1008-cloudelastic1010 in this ticket . We'll ignore 1007, because I was using it to fine-tune a new partman recipe and thus it was already working.
  • I noticed 1008-1010 were not accessible via SSH. The DRAC console showed a blank screen.
  • For each host, I powercycled, logged in via console/root password, and ran puppet. This restored SSH connectivity. However, any subsequent Puppet runs led to PKI errors.
  • I reimaged the hosts a few times after that (using the wrong ticket linked above), with the same results. Eventually I tried used the --new flag for the reimage and was prompted to select a Puppet version. Selecting Puppet 7 allowed the reimages to complete successfully.

This isn't a blocker to our work, so don't feel like you have to dig in too deeply. I've left 1010 up in hopes that it might be useful. If it isn't, ping me and I'll reimage again.

@bking in order for me to investigate further i need either broken host to investigate or a way to replicate the issue.

Change 975824 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: force Puppet 7 for cloudelastic1010

https://gerrit.wikimedia.org/r/975824

Change 975824 merged by Bking:

[operations/puppet@production] cloudelastic: force Puppet 7 for cloudelastic1010

https://gerrit.wikimedia.org/r/975824

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye completed:

  • cloudelastic1010 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311201728_bking_1282704_cloudelastic1010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Gehel triaged this task as High priority.Wed, Nov 22, 9:24 AM
Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.