Page MenuHomePhabricator

Service implementation for cloudelastic1007-1010
Closed, ResolvedPublic

Description

cloudelastic1007-1010 are racked and ready to join the cluster. See Search Platform docs for the procedure (which might be outdated and need review).

AC

  • cloudelastic10[07-10] brought into service
  • decom cloudelastic100[1-4] moving that work into T357780
    • NOTE: Per hieradata/role/eqiad/elasticsearch/cloudelastic.yaml 1001, 1002, and 1004 are the current masters so we'll need to switch these entries to the new hosts

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 974693 had a related patch set uploaded (by Bking; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: bring cloudelastic10[07-10] into svc

https://gerrit.wikimedia.org/r/974693

Change 974693 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: bring cloudelastic10[07-10] into svc

https://gerrit.wikimedia.org/r/974693

Change 974694 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: hosts need racking info

https://gerrit.wikimedia.org/r/974694

Change 974694 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: hosts need racking info

https://gerrit.wikimedia.org/r/974694

Change 974696 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: switch new hosts back to insetup

https://gerrit.wikimedia.org/r/974696

Change 974696 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: switch new hosts back to insetup

https://gerrit.wikimedia.org/r/974696

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1008 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl-public"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl-public"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl-public"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • No changes in confctl are needed to restore the previous state.
  • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1007 (FAIL)
    • Downtimed on Icinga/Alertmanager
    • Set pooled=inactive for the following services on confctl:

{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl-public"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl-public"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl-public"}

  • Disabled Puppet
  • Removed from Puppet and PuppetDB if present and deleted any certificates
  • Removed from Debmonitor if present
  • Forced PXE for next reboot
  • Host rebooted via IPMI
  • No changes in confctl are needed to restore the previous state.
  • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye completed:

  • cloudelastic1007 (WARN)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311161027_jbond_1748240_cloudelastic1007.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye completed:

  • cloudelastic1008 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311161521_bking_2651612_cloudelastic1008.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

@bking i took a look at cloudelastic1010 as i had thought this was in some broken state from the reimage cookbook. however from the puppet certs i can see its been around since Nov 9 07:30:40 2023 GMT and has had puppet disabled for the last 36 hours.

as a side note you shouldn't need to disable puppet when a server has the in-setup role and its bad to do so.

@jbond Sorry for the confusion, I associated the reimage with the wrong ticket. The output of the last reimage is here . Puppet was disabled because the hosts were previously set to their production role, but due to the PKI errors we put them back to insetup. I should have paid more attention...it seems the reimage never actually wiped the disks, whereas I had assumed it failed on later steps.

As far as what led to this situation, I'll try to recount in the hopes that it could be useful:

  • DC Ops did their typical host setup for cloudelastic1008-cloudelastic1010 in this ticket . We'll ignore 1007, because I was using it to fine-tune a new partman recipe and thus it was already working.
  • I noticed 1008-1010 were not accessible via SSH. The DRAC console showed a blank screen.
  • For each host, I powercycled, logged in via console/root password, and ran puppet. This restored SSH connectivity. However, any subsequent Puppet runs led to PKI errors.
  • I reimaged the hosts a few times after that (using the wrong ticket linked above), with the same results. Eventually I tried used the --new flag for the reimage and was prompted to select a Puppet version. Selecting Puppet 7 allowed the reimages to complete successfully.

This isn't a blocker to our work, so don't feel like you have to dig in too deeply. I've left 1010 up in hopes that it might be useful. If it isn't, ping me and I'll reimage again.

@bking in order for me to investigate further i need either broken host to investigate or a way to replicate the issue.

Change 975824 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: force Puppet 7 for cloudelastic1010

https://gerrit.wikimedia.org/r/975824

Change 975824 merged by Bking:

[operations/puppet@production] cloudelastic: force Puppet 7 for cloudelastic1010

https://gerrit.wikimedia.org/r/975824

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:

  • cloudelastic1010 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye completed:

  • cloudelastic1010 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311201728_bking_1282704_cloudelastic1010.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Gehel triaged this task as High priority.Nov 22 2023, 9:24 AM
Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.
bking updated Other Assignee, added: RKemper.
bking removed a subscriber: jbond.

Change 991788 had a related patch set uploaded (by Bking; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: bring cloudelastic10[07-10] into svc

https://gerrit.wikimedia.org/r/991788

Change 991788 merged by Bking:

[operations/puppet@production] cloudelastic: bring cloudelastic10[07-10] into svc

https://gerrit.wikimedia.org/r/991788

Change 991797 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: allow new hosts to request TLS certs

https://gerrit.wikimedia.org/r/991797

Change 991797 merged by Bking:

[operations/puppet@production] cloudelastic: allow new hosts to request TLS certs

https://gerrit.wikimedia.org/r/991797

Change 991845 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: cleanup allowed_regexes

https://gerrit.wikimedia.org/r/991845

Change 991845 merged by Bking:

[operations/puppet@production] cloudelastic: cleanup allowed_regexes

https://gerrit.wikimedia.org/r/991845

We got a diffscan alert as those servers are running on public IPs and new ports are exposed to the diffscan cloudVM.

After a quick look it seems like those servers already expose their endpoint through LVS (cloudelastic.wikimedia.org), that's why I'm wondering why can't they be in the private vlans ? If they have good reasons, could they be defined somewhere ? if not could they be re-numbered to private IPs ?
See https://wikitech.wikimedia.org/wiki/Wikimedia_network_guidelines#Public_IPs

@ayounsi Thanks for the link. We're in the process of rolling out new hosts and unfortunately, we reused the existing puppet code without much thought about public IPs. What is the urgency on this request and how long do you think it would take to re-ip these servers? If there are any docs on how to do this let us know.

@taavi Indeed, I was thinking of that one too. I'll post an update there.

What is the urgency on this request

Without sounding alarming, if they don't need public IPs, it should be done now to prevent having to handle them for the next 5 years. Public IP hosts are quite a pain for the reasons listed on the wiki page.

and how long do you think it would take to re-ip these servers? If there are any docs on how to do this let us know.

That's quite straightforward, outside of re-image scripts, I'd say 15min per servers.
The procedure is on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs and I can walk you through it no pb.

From timeline and my understanding of traffic flows and service owner it seems like the hosts are more suited in the prod private vlan than cloud-private but happy to discuss it.

and how long do you think it would take to re-ip these servers? If there are any docs on how to do this let us know.

That's quite straightforward, outside of re-image scripts, I'd say 15min per servers.
The procedure is on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs and I can walk you through it no pb.

Likewise I am happy to assist here if needed. The process is a little crunky, but not too tricky if they are new servers not yet live.

It'd be a real shame to bring live a bunch of new servers on the public vlan, using those IPs up for the next few years when we don't need to.

Change 992538 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: promote new hosts to master-eligible

https://gerrit.wikimedia.org/r/992538

Change 992538 merged by Bking:

[operations/puppet@production] cloudelastic: promote new hosts to master-eligible

https://gerrit.wikimedia.org/r/992538

Change 993038 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: remove old masters

https://gerrit.wikimedia.org/r/993038

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:08:57Z] <ryankemper> T351354 Downtimed cloudelastic*; shortly will restart cloudelastic100[1,2,4] one host at a time to make them no longer masters

Change 993038 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: remove old masters

https://gerrit.wikimedia.org/r/993038

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:15:50Z] <ryankemper> T351354 Restarting cloudelastic1004 following puppet run

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:25:58Z] <ryankemper> T351354 Restarting cloudelastic1002

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:33:19Z] <ryankemper> T351354 Now restarting new masters to keep configs in sync; restarting cloudelastic1007

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:34:42Z] <ryankemper> T351354 Now restarting new masters to keep configs in sync; restarting cloudelastic1009

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:40:06Z] <ryankemper> T351354 Restarting cloudelastic1006 (final restart for today)

Old masters are no longer master-eligible. They're still participating in the actual cluster; we're holding off on the physical decom until T355617 is done

cloudelastic10[07-10] are now in service (most work happened in T355617) . Closing....

bking updated the task description. (Show Details)