Page MenuHomePhabricator

Decommission old ELK5 Logstash cluster
Closed, ResolvedPublic

Description

The switchover to the new cluster happened back in January, why is the old cluster still around? Running the decom cookbooks and cleaning out Puppet references should take less than an hour and keeping it around is an ongoing maintenance burden (reboots, Java updates etc.)

Event Timeline

why is the old cluster still around?

There are a few reasons. First, the ELK5 hardware hosts the Kafka brokers for kafka-logging. There is only one kafka-logging cluster per site, and both ELK versions consumed from it. After the ELK switchover in Jan we procured dedicated kafka-logging hardware in Feb. That was racked and set up in Mar, and about a week ago I finished migrating the eqiad brokers to the kafka-logging100[123] hardware T279342. The codfw broker migrations are wip now, which shortly will unblock the next thing...

Second thing, the ELK5 hardware was refreshed in 2019, it's actually newer than the current ELK7 hardware. So we'll reimage and merge these hosts into the ELK7 cluster, then decom the 3 oldest hosts from the ELK7 cluster.

As for the VMs, we have deprecated all non-kafka inputs and the v7 cluster is configured for kafka input only. The few non-kafka inputs that still exist are ingested by the ELK5 VMs and output to kafka. We could shrink the spec and number of ganeti VMs if that's a resource issue. Or possibly kill them off entirely, as the inputs have been deprecated for some time now.

Change 685090 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash101[012]: prep for reimaging

https://gerrit.wikimedia.org/r/685090

Change 685090 merged by Herron:

[operations/puppet@production] logstash101[012]: prep for reimaging

https://gerrit.wikimedia.org/r/685090

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash1010.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105111437_herron_19044_logstash1010_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['logstash1010.eqiad.wmnet']

and were ALL successful.

Change 689166 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash101[012]: use default OS installer version

https://gerrit.wikimedia.org/r/689166

Change 689166 merged by Herron:

[operations/puppet@production] logstash101[012]: use default OS installer version

https://gerrit.wikimedia.org/r/689166

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash1010.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105111707_herron_8861_logstash1010_eqiad_wmnet.log.

Change 689189 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] scap: switch logstash check host to logstash1023

https://gerrit.wikimedia.org/r/689189

Completed auto-reimage of hosts:

['logstash1010.eqiad.wmnet']

and were ALL successful.

Change 689189 merged by Herron:

[operations/puppet@production] scap: switch logstash check host to logstash1023

https://gerrit.wikimedia.org/r/689189

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash1011.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105111827_herron_19227_logstash1011_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['logstash1011.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash1012.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202105112109_herron_9184_logstash1012_eqiad_wmnet.log.

Completed auto-reimage of hosts:

['logstash1012.eqiad.wmnet']

and were ALL successful.

Change 689977 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: remove kibana and elasticsearch from role::logstash

https://gerrit.wikimedia.org/r/689977

Change 689994 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: add logstash101[012] to elk7 cluster as ES backends

https://gerrit.wikimedia.org/r/689994

@herron I acked some alerts related to logstash100[7-9]'s ES on port 9200 not responsive, IIUC we are waiting for https://gerrit.wikimedia.org/r/689977 for the clean up right?

Change 689977 merged by Herron:

[operations/puppet@production] logstash: remove kibana and elasticsearch from role::logstash

https://gerrit.wikimedia.org/r/689977

Change 690532 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: remove elasticsearch from role::logstash

https://gerrit.wikimedia.org/r/690532

Change 690532 merged by Herron:

[operations/puppet@production] logstash: remove elasticsearch from role::logstash

https://gerrit.wikimedia.org/r/690532

Change 689994 merged by Herron:

[operations/puppet@production] logstash: add logstash101[012] to elk7 cluster as ES backends

https://gerrit.wikimedia.org/r/689994

Change 693436 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: remove references to hosts logstash102[012]

https://gerrit.wikimedia.org/r/693436

Change 693436 merged by Herron:

[operations/puppet@production] logstash: remove references to hosts logstash102[012]

https://gerrit.wikimedia.org/r/693436

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: logstash1020.eqiad.wmnet

  • logstash1020.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found physical host
    • Downtimed management interface on Icinga
    • Wiped all swraid, partition-table and filesystem signatures
    • Powered off
    • Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB

Change 701611 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: add logstash200[123] to v7 cluster

https://gerrit.wikimedia.org/r/701611

Change 701611 merged by Herron:

[operations/puppet@production] logstash: add logstash200[123] to v7 cluster

https://gerrit.wikimedia.org/r/701611

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash2001.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107211549_herron_15145_logstash2001_codfw_wmnet.log.

Completed auto-reimage of hosts:

['logstash2001.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash2002.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107211643_herron_31177_logstash2002_codfw_wmnet.log.

Completed auto-reimage of hosts:

['logstash2002.codfw.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:

logstash2003.codfw.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107211747_herron_23546_logstash2003_codfw_wmnet.log.

Completed auto-reimage of hosts:

['logstash2003.codfw.wmnet']

and were ALL successful.

Change 708311 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: remove references to logstash202[012]

https://gerrit.wikimedia.org/r/708311

Change 708311 merged by Herron:

[operations/puppet@production] logstash: remove references to logstash202[012]

https://gerrit.wikimedia.org/r/708311

herron claimed this task.

All elk5 hardware has been decommed at this point.

We have 3 Ganeti VMs per-site remaining which are needed to handle the legacy logstash inputs. We're working to retire these too, but I think for the purposes of decomming hardware and reducing the elk5 surface area for updates and reboots we are good here and can move over to T227080 for tracking the retirement of the remaining legacy inputs.

Reopening this as progress has been made retiring legacy log inputs and now we're ready to move on to decom of the Ganeti VMs.

In prep for shutting down the Ganeti VMs we will need to relocate the api-feature-usage outputs to another logstash cluster, linking a task for that step.

Change 755467 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] logstash: move elk5 collectors to role::spare::system

https://gerrit.wikimedia.org/r/755467

Change 755467 merged by Herron:

[operations/puppet@production] logstash: move elk5 collectors to role::spare::system

https://gerrit.wikimedia.org/r/755467

Change 755477 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] set kibana and kibana-ssl monitoring to non-critical

https://gerrit.wikimedia.org/r/755477

Change 755477 merged by Herron:

[operations/puppet@production] set kibana and kibana-ssl monitoring to non-critical

https://gerrit.wikimedia.org/r/755477

Change 755480 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] remove elk5 related LVS services

https://gerrit.wikimedia.org/r/755480

Change 755480 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] remove elk5 related LVS services

https://gerrit.wikimedia.org/r/755480

There are icinga alerts open relating to this which I have acked. It's Not ideal for those to be open for any length of time, but erring on the side of caution for a proper review before moving forward to avoid additional false positives or worse.

Change 755480 merged by Herron:

[operations/puppet@production] remove elk5 related LVS services

https://gerrit.wikimedia.org/r/755480

Change 755480 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] remove elk5 related LVS services

https://gerrit.wikimedia.org/r/755480

There are icinga alerts open relating to this which I have acked. It's Not ideal for those to be open for any length of time, but erring on the side of caution for a proper review before moving forward to avoid additional false positives or worse.

This was completed a few days ago please see T299700 for more detail

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: logstash[1007-1009].eqiad.wmnet

  • logstash1007.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • logstash1008.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
  • logstash1009.eqiad.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: logstash[2004-2006].codfw.wmnet

  • logstash2004.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
  • logstash2005.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
  • logstash2006.codfw.wmnet (PASS)
    • Downtimed host on Icinga
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox

Change 856521 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove puppet leftovers of old ELK5 hosts

https://gerrit.wikimedia.org/r/856521

Change 856521 merged by Herron:

[operations/puppet@production] Remove puppet leftovers of old ELK5 hosts

https://gerrit.wikimedia.org/r/856521

Change 856934 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove obsolete Puppet references related to decomissioned ELK5 clusters

https://gerrit.wikimedia.org/r/856934

Change 856934 merged by Herron:

[operations/puppet@production] Remove obsolete Puppet references related to decomissioned ELK5 clusters

https://gerrit.wikimedia.org/r/856934