Decommission old ELK5 Logstash cluster
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• MoritzMuehlenhoff
	Apr 27 2021, 2:10 PM

Description

The switchover to the new cluster happened back in January, why is the old cluster still around? Running the decom cookbooks and cleaning out Puppet references should take less than an hour and keeping it around is an ongoing maintenance burden (reboots, Java updates etc.)

Details

Subject	Repo	Branch	Lines +/-
Remove obsolete Puppet references related to decomissioned ELK5 clusters	operations/puppet	production	+0 -28
Remove puppet leftovers of old ELK5 hosts	operations/puppet	production	+0 -52
remove elk5 related LVS services	operations/puppet	production	+0 -186
set kibana and kibana-ssl monitoring to non-critical	operations/puppet	production	+2 -2
logstash: move elk5 collectors to role::spare::system	operations/puppet	production	+2 -4
logstash: remove references to logstash202[012]	operations/puppet	production	+1 -34
logstash: add logstash200[123] to v7 cluster	operations/puppet	production	+13 -15
logstash: remove references to hosts logstash102[012]	operations/puppet	production	+0 -54
logstash: add logstash101[012] to elk7 cluster as ES backends	operations/puppet	production	+47 -1
logstash: remove elasticsearch from role::logstash	operations/puppet	production	+0 -2
logstash: remove kibana and elasticsearch from role::logstash	operations/puppet	production	+0 -4
scap: switch logstash check host to logstash1023	operations/puppet	production	+1 -1
logstash101[012]: use default OS installer version	operations/puppet	production	+0 -3
logstash101[012]: prep for reimaging	operations/puppet	production	+1 -21

Related Objects
Search...

Status	Assigned	Task
Resolved	herron	T281266 Decommission old ELK5 Logstash cluster
Resolved	• Cmjohnson	T283507 decommission logstash102[012]
Resolved	Papaul	T287496 decommission servers logstash202[012].codfw.wmnet
Resolved	herron	T297239 Move logstash api-feature-usage output away from v5 cluster
Resolved	EBernhardson	T217742 Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage
Resolved	colewhite	T176335 logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable
Resolved	dcausse	T176430 api feature logs should be sent to both eqiad and codfw clusters
Resolved	herron	T288620 Document path forward and Retire remaining non-Kafka Logstash inputs
Resolved	herron	T298794 eqiad/codfw: 2 VMs requested for apifeatureusage
Resolved	herron	T299700 Remove legacy ELK LVS entries

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 27 2021, 2:10 PM

• MoritzMuehlenhoff triaged this task as High priority.Apr 27 2021, 2:10 PM

why is the old cluster still around?

There are a few reasons. First, the ELK5 hardware hosts the Kafka brokers for kafka-logging. There is only one kafka-logging cluster per site, and both ELK versions consumed from it. After the ELK switchover in Jan we procured dedicated kafka-logging hardware in Feb. That was racked and set up in Mar, and about a week ago I finished migrating the eqiad brokers to the kafka-logging100[123] hardware T279342. The codfw broker migrations are wip now, which shortly will unblock the next thing...

Second thing, the ELK5 hardware was refreshed in 2019, it's actually newer than the current ELK7 hardware. So we'll reimage and merge these hosts into the ELK7 cluster, then decom the 3 oldest hosts from the ELK7 cluster.

As for the VMs, we have deprecated all non-kafka inputs and the v7 cluster is configured for kafka input only. The few non-kafka inputs that still exist are ingested by the ELK5 VMs and output to kafka. We could shrink the spec and number of ganeti VMs if that's a resource issue. Or possibly kill them off entirely, as the inputs have been deprecated for some time now.

lmata moved this task from Inbox to In progress on the observability board.May 3 2021, 3:36 PM

Change 685090 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash101[012]: prep for reimaging

https://gerrit.wikimedia.org/r/685090

gerritbot added a project: Patch-For-Review.May 4 2021, 8:21 PM

Change 685090 merged by Herron:

[operations/puppet@production] logstash101[012]: prep for reimaging

https://gerrit.wikimedia.org/r/685090

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash1010.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105111437_herron_19044_logstash1010_eqiad_wmnet.log.

Maintenance_bot removed a project: Patch-For-Review.May 11 2021, 3:10 PM

Completed auto-reimage of hosts:

['logstash1010.eqiad.wmnet']

and were ALL successful.

Change 689166 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash101[012]: use default OS installer version

https://gerrit.wikimedia.org/r/689166

Change 689166 merged by Herron:

[operations/puppet@production] logstash101[012]: use default OS installer version

https://gerrit.wikimedia.org/r/689166

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash1010.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105111707_herron_8861_logstash1010_eqiad_wmnet.log.

Maintenance_bot removed a project: Patch-For-Review.May 11 2021, 5:10 PM

Change 689189 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] scap: switch logstash check host to logstash1023

https://gerrit.wikimedia.org/r/689189

gerritbot added a project: Patch-For-Review.May 11 2021, 5:47 PM

Completed auto-reimage of hosts:

['logstash1010.eqiad.wmnet']

and were ALL successful.

Change 689189 merged by Herron:

[operations/puppet@production] scap: switch logstash check host to logstash1023

https://gerrit.wikimedia.org/r/689189

Maintenance_bot removed a project: Patch-For-Review.May 11 2021, 6:10 PM

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash1011.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105111827_herron_19227_logstash1011_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['logstash1011.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash1012.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105112109_herron_9184_logstash1012_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['logstash1012.eqiad.wmnet']

and were ALL successful.

Change 689977 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: remove kibana and elasticsearch from role::logstash

https://gerrit.wikimedia.org/r/689977

gerritbot added a project: Patch-For-Review.May 12 2021, 5:04 PM

Change 689994 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: add logstash101[012] to elk7 cluster as ES backends

https://gerrit.wikimedia.org/r/689994

@herron I acked some alerts related to logstash100[7-9]'s ES on port 9200 not responsive, IIUC we are waiting for https://gerrit.wikimedia.org/r/689977 for the clean up right?

Change 689977 merged by Herron:

[operations/puppet@production] logstash: remove kibana and elasticsearch from role::logstash

https://gerrit.wikimedia.org/r/689977

Change 690532 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: remove elasticsearch from role::logstash

https://gerrit.wikimedia.org/r/690532

Change 690532 merged by Herron:

[operations/puppet@production] logstash: remove elasticsearch from role::logstash

https://gerrit.wikimedia.org/r/690532

Change 689994 merged by Herron:

[operations/puppet@production] logstash: add logstash101[012] to elk7 cluster as ES backends

https://gerrit.wikimedia.org/r/689994

Maintenance_bot removed a project: Patch-For-Review.May 17 2021, 9:10 PM

Change 693436 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: remove references to hosts logstash102[012]

https://gerrit.wikimedia.org/r/693436

gerritbot added a project: Patch-For-Review.May 21 2021, 2:48 PM

Change 693436 merged by Herron:

[operations/puppet@production] logstash: remove references to hosts logstash102[012]

https://gerrit.wikimedia.org/r/693436

herron added a subtask: T283507: decommission logstash102[012].May 24 2021, 2:45 PM

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: logstash1020.eqiad.wmnet

logstash1020.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB

herron mentioned this in T283507: decommission logstash102[012].May 24 2021, 2:48 PM

lmata moved this task from In progress to Epics In Progress on the observability board.Jun 14 2021, 3:49 PM

herron mentioned this in T234854: Upgrade ELK Stack to version 7.Jun 14 2021, 3:57 PM

Change 701611 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: add logstash200[123] to v7 cluster

https://gerrit.wikimedia.org/r/701611

lmata edited projects, added SRE Observability (FY2021/2022-Q1); removed observability.Jul 12 2021, 2:41 AM

lmata moved this task from Inbox to Epics In Progress on the SRE Observability (FY2021/2022-Q1) board.

Change 701611 merged by Herron:

[operations/puppet@production] logstash: add logstash200[123] to v7 cluster

https://gerrit.wikimedia.org/r/701611

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107211549_herron_15145_logstash2001_codfw_wmnet.log.

Completed auto-reimage of hosts:

['logstash2001.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash2002.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107211643_herron_31177_logstash2002_codfw_wmnet.log.

Completed auto-reimage of hosts:

['logstash2002.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107211747_herron_23546_logstash2003_codfw_wmnet.log.

Completed auto-reimage of hosts:

['logstash2003.codfw.wmnet']

and were ALL successful.

Change 708311 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: remove references to logstash202[012]

https://gerrit.wikimedia.org/r/708311

Change 708311 merged by Herron:

[operations/puppet@production] logstash: remove references to logstash202[012]

https://gerrit.wikimedia.org/r/708311

herron added a subtask: T287496: decommission servers logstash202[012].codfw.wmnet.Jul 27 2021, 5:30 PM

All elk5 hardware has been decommed at this point.

We have 3 Ganeti VMs per-site remaining which are needed to handle the legacy logstash inputs. We're working to retire these too, but I think for the purposes of decomming hardware and reducing the elk5 surface area for updates and reboots we are good here and can move over to T227080 for tracking the retirement of the remaining legacy inputs.

Papaul closed subtask T287496: decommission servers logstash202[012].codfw.wmnet as Resolved.Aug 2 2021, 3:47 PM

• Cmjohnson closed subtask T283507: decommission logstash102[012] as Resolved.Aug 25 2021, 6:05 PM

lmata moved this task from Epics In Progress to Done on the SRE Observability (FY2021/2022-Q1) board.Sep 14 2021, 9:36 PM

Reopening this as progress has been made retiring legacy log inputs and now we're ready to move on to decom of the Ganeti VMs.

In prep for shutting down the Ganeti VMs we will need to relocate the api-feature-usage outputs to another logstash cluster, linking a task for that step.

herron moved this task from FY2021/2022-Q1 to FY2021/2022-Q3 on the SRE Observability board.Jan 11 2022, 7:22 PM

herron edited projects, added SRE Observability (FY2021/2022-Q3); removed SRE Observability (FY2021/2022-Q1).

Change 755467 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: move elk5 collectors to role::spare::system

https://gerrit.wikimedia.org/r/755467

Change 755467 merged by Herron:

[operations/puppet@production] logstash: move elk5 collectors to role::spare::system

https://gerrit.wikimedia.org/r/755467

Change 755477 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] set kibana and kibana-ssl monitoring to non-critical

https://gerrit.wikimedia.org/r/755477

Change 755477 merged by Herron:

[operations/puppet@production] set kibana and kibana-ssl monitoring to non-critical

https://gerrit.wikimedia.org/r/755477

Change 755480 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] remove elk5 related LVS services

https://gerrit.wikimedia.org/r/755480

herron closed subtask T297239: Move logstash api-feature-usage output away from v5 cluster as Resolved.Jan 19 2022, 8:57 PM

herron mentioned this in T297239: Move logstash api-feature-usage output away from v5 cluster.

In T281266#7634280, @gerritbot wrote:

Change 755480 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] remove elk5 related LVS services

https://gerrit.wikimedia.org/r/755480

There are icinga alerts open relating to this which I have acked. It's Not ideal for those to be open for any length of time, but erring on the side of caution for a proper review before moving forward to avoid additional false positives or worse.

Change 755480 merged by Herron:

[operations/puppet@production] remove elk5 related LVS services

https://gerrit.wikimedia.org/r/755480

herron closed subtask T299700: Remove legacy ELK LVS entries as Resolved.Jan 21 2022, 8:21 PM

In T281266#7634510, @herron wrote:

In T281266#7634280, @gerritbot wrote:

Change 755480 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] remove elk5 related LVS services

https://gerrit.wikimedia.org/r/755480

There are icinga alerts open relating to this which I have acked. It's Not ideal for those to be open for any length of time, but erring on the side of caution for a proper review before moving forward to avoid additional false positives or worse.

This was completed a few days ago please see T299700 for more detail

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: logstash[1007-1009].eqiad.wmnet

logstash1007.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

logstash1008.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

logstash1009.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: logstash[2004-2006].codfw.wmnet

logstash2004.codfw.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

logstash2005.codfw.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

logstash2006.codfw.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox