Service implementation for cloudelastic1007-1010
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	bking
	Nov 15 2023, 9:35 PM

Description

cloudelastic1007-1010 are racked and ready to join the cluster. See Search Platform docs for the procedure (which might be outdated and need review).

cloudelastic10[07-10] brought into service
~~decom cloudelastic100[1-4]~~ moving that work into T357780
- NOTE: Per hieradata/role/eqiad/elasticsearch/cloudelastic.yaml 1001, 1002, and 1004 are the current masters so we'll need to switch these entries to the new hosts

Details

Other Assignee: RKemper

Subject	Repo	Branch	Lines +/-
cloudelastic: remove old masters	operations/puppet	production	+0 -3
cloudelastic: promote new hosts to master-eligible	operations/puppet	production	+3 -3
cloudelastic: cleanup allowed_regexes	operations/puppet	production	+1 -2
cloudelastic: allow new hosts to request TLS certs	operations/puppet	production	+2 -1
cloudelastic: bring cloudelastic10[07-10] into svc	operations/puppet	production	+5 -1
cloudelastic: force Puppet 7 for cloudelastic1010	operations/puppet	production	+2 -0
cloudelastic: switch new hosts back to insetup	operations/puppet	production	+1 -2
cloudelastic: hosts need racking info	operations/puppet	production	+8 -0
cloudelastic: bring cloudelastic10[07-10] into svc	operations/puppet	production	+9 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	bking	T351354 Service implementation for cloudelastic1007-1010
Resolved	bking	T355617 Migrate cloudelastic from public to private IPs
Resolved	bking	T355720 Change TLS/load balancer configuration for cloudelastic
Resolved	VRiley-WMF	T356919 Comm Error: Backplane 0 on cloudelastic1008
Resolved	bking	T356941 Stale data/failed queries on wikidatawiki index

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 15 2023, 9:35 PM

RKemper updated the task description. (Show Details)Nov 15 2023, 10:10 PM

RKemper updated the task description. (Show Details)Nov 15 2023, 10:15 PM

Change 974693 had a related patch set uploaded (by Bking; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: bring cloudelastic10[07-10] into svc

https://gerrit.wikimedia.org/r/974693

Change 974693 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: bring cloudelastic10[07-10] into svc

https://gerrit.wikimedia.org/r/974693

Mentioned in SAL (#wikimedia-operations) [2023-11-15T22:20:06Z] <ryankemper> T351354 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/974693; running puppet on hosts

Change 974694 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: hosts need racking info

https://gerrit.wikimedia.org/r/974694

Change 974694 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: hosts need racking info

https://gerrit.wikimedia.org/r/974694

Maintenance_bot removed a project: Patch-For-Review.Nov 15 2023, 10:30 PM

Change 974696 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: switch new hosts back to insetup

https://gerrit.wikimedia.org/r/974696

gerritbot added a project: Patch-For-Review.Nov 15 2023, 10:54 PM

Change 974696 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: switch new hosts back to insetup

https://gerrit.wikimedia.org/r/974696

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye

Maintenance_bot removed a project: Patch-For-Review.Nov 15 2023, 11:10 PM

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye executed with errors:

cloudelastic1008 (FAIL)
- Downtimed on Icinga/Alertmanager
- Set pooled=inactive for the following services on confctl:

{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl-public"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl-public"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl"}
{"cloudelastic1008.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl-public"}

Disabled Puppet
Removed from Puppet and PuppetDB if present and deleted any certificates
Removed from Debmonitor if present
Forced PXE for next reboot
Host rebooted via IPMI
No changes in confctl are needed to restore the previous state.
The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye executed with errors:

cloudelastic1007 (FAIL)
- Downtimed on Icinga/Alertmanager
- Set pooled=inactive for the following services on confctl:

{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-psi-ssl-public"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-chi-ssl-public"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl"}
{"cloudelastic1007.wikimedia.org": {"weight": 0, "pooled": "inactive"}, "tags": "dc=eqiad,cluster=cloudelastic,service=cloudelastic-omega-ssl-public"}

Disabled Puppet
Removed from Puppet and PuppetDB if present and deleted any certificates
Removed from Debmonitor if present
Forced PXE for next reboot
Host rebooted via IPMI
No changes in confctl are needed to restore the previous state.
The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye completed:

cloudelastic1007 (WARN)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311161027_jbond_1748240_cloudelastic1007.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is not optimal, downtime not removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye completed:

cloudelastic1008 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311161521_bking_2651612_cloudelastic1008.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

@bking i took a look at cloudelastic1010 as i had thought this was in some broken state from the reimage cookbook. however from the puppet certs i can see its been around since Nov 9 07:30:40 2023 GMT and has had puppet disabled for the last 36 hours.

as a side note you shouldn't need to disable puppet when a server has the in-setup role and its bad to do so.

@jbond Sorry for the confusion, I associated the reimage with the wrong ticket. The output of the last reimage is here . Puppet was disabled because the hosts were previously set to their production role, but due to the PKI errors we put them back to insetup. I should have paid more attention...it seems the reimage never actually wiped the disks, whereas I had assumed it failed on later steps.

As far as what led to this situation, I'll try to recount in the hopes that it could be useful:

DC Ops did their typical host setup for cloudelastic1008-cloudelastic1010 in this ticket . We'll ignore 1007, because I was using it to fine-tune a new partman recipe and thus it was already working.
I noticed 1008-1010 were not accessible via SSH. The DRAC console showed a blank screen.
For each host, I powercycled, logged in via console/root password, and ran puppet. This restored SSH connectivity. However, any subsequent Puppet runs led to PKI errors.
I reimaged the hosts a few times after that (using the wrong ticket linked above), with the same results. Eventually I tried used the --new flag for the reimage and was prompted to select a Puppet version. Selecting Puppet 7 allowed the reimages to complete successfully.

This isn't a blocker to our work, so don't feel like you have to dig in too deeply. I've left 1010 up in hopes that it might be useful. If it isn't, ping me and I'll reimage again.

@bking in order for me to investigate further i need either broken host to investigate or a way to replicate the issue.

Change 975824 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: force Puppet 7 for cloudelastic1010

https://gerrit.wikimedia.org/r/975824

gerritbot added a project: Patch-For-Review.Nov 20 2023, 2:14 PM

Change 975824 merged by Bking:

[operations/puppet@production] cloudelastic: force Puppet 7 for cloudelastic1010

https://gerrit.wikimedia.org/r/975824

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Maintenance_bot removed a project: Patch-For-Review.Nov 20 2023, 2:30 PM

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:

cloudelastic1010 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- The reimage failed, see the cookbook logs for the details

Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye completed:

cloudelastic1010 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202311201728_bking_1282704_cloudelastic1010.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Gehel triaged this task as High priority.Nov 22 2023, 9:24 AM

Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.

Gehel moved this task from Ready for Work to Hardware refresh on the Data-Platform-SRE board.Dec 6 2023, 1:11 PM

bking claimed this task.Jan 19 2024, 2:56 PM

bking updated Other Assignee, added: RKemper.

bking edited projects, added Data-Platform-SRE (2024.01.01 - 2024.01.21); removed Data-Platform-SRE.

bking removed a subscriber: jbond.

Change 991788 had a related patch set uploaded (by Bking; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: bring cloudelastic10[07-10] into svc

https://gerrit.wikimedia.org/r/991788

gerritbot added a project: Patch-For-Review.Jan 19 2024, 3:03 PM

Change 991788 merged by Bking:

[operations/puppet@production] cloudelastic: bring cloudelastic10[07-10] into svc

https://gerrit.wikimedia.org/r/991788

Maintenance_bot removed a project: Patch-For-Review.Jan 19 2024, 3:30 PM

Change 991797 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: allow new hosts to request TLS certs

https://gerrit.wikimedia.org/r/991797

gerritbot added a project: Patch-For-Review.Jan 19 2024, 4:20 PM

Change 991797 merged by Bking:

[operations/puppet@production] cloudelastic: allow new hosts to request TLS certs

https://gerrit.wikimedia.org/r/991797

Maintenance_bot removed a project: Patch-For-Review.Jan 19 2024, 4:30 PM

Change 991845 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: cleanup allowed_regexes

https://gerrit.wikimedia.org/r/991845

gerritbot added a project: Patch-For-Review.Jan 19 2024, 9:50 PM

Change 991845 merged by Bking:

[operations/puppet@production] cloudelastic: cleanup allowed_regexes

https://gerrit.wikimedia.org/r/991845

Maintenance_bot removed a project: Patch-For-Review.Jan 19 2024, 11:30 PM

We got a diffscan alert as those servers are running on public IPs and new ports are exposed to the diffscan cloudVM.

After a quick look it seems like those servers already expose their endpoint through LVS (cloudelastic.wikimedia.org), that's why I'm wondering why can't they be in the private vlans ? If they have good reasons, could they be defined somewhere ? if not could they be re-numbered to private IPs ?
See https://wikitech.wikimedia.org/wiki/Wikimedia_network_guidelines#Public_IPs

bking moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.01 - 2024.01.21) board.Jan 22 2024, 2:15 PM

@ayounsi Thanks for the link. We're in the process of rolling out new hosts and unfortunately, we reused the existing puppet code without much thought about public IPs. What is the urgency on this request and how long do you think it would take to re-ip these servers? If there are any docs on how to do this let us know.

Somewhat related: T346946: move cloudelastic behind cloudlb

@taavi Indeed, I was thinking of that one too. I'll post an update there.

What is the urgency on this request

Without sounding alarming, if they don't need public IPs, it should be done now to prevent having to handle them for the next 5 years. Public IP hosts are quite a pain for the reasons listed on the wiki page.

and how long do you think it would take to re-ip these servers? If there are any docs on how to do this let us know.

That's quite straightforward, outside of re-image scripts, I'd say 15min per servers.
The procedure is on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs and I can walk you through it no pb.

From timeline and my understanding of traffic flows and service owner it seems like the hosts are more suited in the prod private vlan than cloud-private but happy to discuss it.

In T351354#9477734, @ayounsi wrote:

and how long do you think it would take to re-ip these servers? If there are any docs on how to do this let us know.

That's quite straightforward, outside of re-image scripts, I'd say 15min per servers.
The procedure is on https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Move_existing_server_between_rows/racks,_changing_IPs and I can walk you through it no pb.

Likewise I am happy to assist here if needed. The process is a little crunky, but not too tricky if they are new servers not yet live.

It'd be a real shame to bring live a bunch of new servers on the public vlan, using those IPs up for the next few years when we don't need to.

Gehel edited projects, added Data-Platform-SRE (2024.01.22 - 2024.02.11); removed Data-Platform-SRE (2024.01.01 - 2024.01.21).Jan 23 2024, 1:54 PM

Gehel moved this task from Backlog to In Progress on the Data-Platform-SRE (2024.01.22 - 2024.02.11) board.

cmooney mentioned this in T355617: Migrate cloudelastic from public to private IPs.Jan 23 2024, 3:11 PM

Change 992538 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cloudelastic: promote new hosts to master-eligible

https://gerrit.wikimedia.org/r/992538

gerritbot added a project: Patch-For-Review.Jan 23 2024, 10:00 PM

Change 992538 merged by Bking:

[operations/puppet@production] cloudelastic: promote new hosts to master-eligible

https://gerrit.wikimedia.org/r/992538

Maintenance_bot removed a project: Patch-For-Review.Jan 24 2024, 10:30 PM

Forgot to add the Bug: label but https://gerrit.wikimedia.org/r/c/operations/puppet/+/992826 is part of this ticket as well

Change 993038 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/puppet@production] cloudelastic: remove old masters

https://gerrit.wikimedia.org/r/993038

gerritbot added a project: Patch-For-Review.Jan 25 2024, 9:35 PM

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:08:57Z] <ryankemper> T351354 Downtimed cloudelastic*; shortly will restart cloudelastic100[1,2,4] one host at a time to make them no longer masters

Change 993038 merged by Ryan Kemper:

[operations/puppet@production] cloudelastic: remove old masters

https://gerrit.wikimedia.org/r/993038

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:11:25Z] <ryankemper> T351354 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/993038; restarting cloudelastic1001 following puppet run

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:15:50Z] <ryankemper> T351354 Restarting cloudelastic1004 following puppet run

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:25:58Z] <ryankemper> T351354 Restarting cloudelastic1002

Maintenance_bot removed a project: Patch-For-Review.Jan 25 2024, 10:30 PM

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:33:19Z] <ryankemper> T351354 Now restarting new masters to keep configs in sync; restarting cloudelastic1007

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:34:42Z] <ryankemper> T351354 Now restarting new masters to keep configs in sync; restarting cloudelastic1009

Mentioned in SAL (#wikimedia-operations) [2024-01-25T22:40:06Z] <ryankemper> T351354 Restarting cloudelastic1006 (final restart for today)

Old masters are no longer master-eligible. They're still participating in the actual cluster; we're holding off on the physical decom until T355617 is done

Gehel edited projects, added Data-Platform-SRE (2024.02.12 - 2024.03.03); removed Data-Platform-SRE (2024.01.22 - 2024.02.11).Feb 9 2024, 10:44 AM

Gehel moved this task from Backlog to Blocked / Waiting on the Data-Platform-SRE (2024.02.12 - 2024.03.03) board.

bking updated the task description. (Show Details)Feb 16 2024, 2:25 PM

cloudelastic10[07-10] are now in service (most work happened in T355617) . Closing....

bking updated the task description. (Show Details)Feb 16 2024, 2:31 PM

bking updated the task description. (Show Details)

Service implementation for cloudelastic1007-1010Closed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Service implementation for cloudelastic1007-1010
Closed, ResolvedPublic
Actions

Related Objects
Search...