The switchover to the new cluster happened back in January, why is the old cluster still around? Running the decom cookbooks and cleaning out Puppet references should take less than an hour and keeping it around is an ongoing maintenance burden (reboots, Java updates etc.)
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | herron | T281266 Decommission old ELK5 Logstash cluster | |||
Resolved | • Cmjohnson | T283507 decommission logstash102[012] | |||
Resolved | Papaul | T287496 decommission servers logstash202[012].codfw.wmnet | |||
Resolved | herron | T297239 Move logstash api-feature-usage output away from v5 cluster | |||
Resolved | EBernhardson | T217742 Rework the data flow between logstash and cirrus elasticsearch cluster for ApiFeatureUsage | |||
Resolved | colewhite | T176335 logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable | |||
Resolved | dcausse | T176430 api feature logs should be sent to both eqiad and codfw clusters | |||
Resolved | herron | T288620 Document path forward and Retire remaining non-Kafka Logstash inputs | |||
Resolved | herron | T298794 eqiad/codfw: 2 VMs requested for apifeatureusage | |||
Resolved | herron | T299700 Remove legacy ELK LVS entries |
Event Timeline
why is the old cluster still around?
There are a few reasons. First, the ELK5 hardware hosts the Kafka brokers for kafka-logging. There is only one kafka-logging cluster per site, and both ELK versions consumed from it. After the ELK switchover in Jan we procured dedicated kafka-logging hardware in Feb. That was racked and set up in Mar, and about a week ago I finished migrating the eqiad brokers to the kafka-logging100[123] hardware T279342. The codfw broker migrations are wip now, which shortly will unblock the next thing...
Second thing, the ELK5 hardware was refreshed in 2019, it's actually newer than the current ELK7 hardware. So we'll reimage and merge these hosts into the ELK7 cluster, then decom the 3 oldest hosts from the ELK7 cluster.
As for the VMs, we have deprecated all non-kafka inputs and the v7 cluster is configured for kafka input only. The few non-kafka inputs that still exist are ingested by the ELK5 VMs and output to kafka. We could shrink the spec and number of ganeti VMs if that's a resource issue. Or possibly kill them off entirely, as the inputs have been deprecated for some time now.
Change 685090 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] logstash101[012]: prep for reimaging
Change 685090 merged by Herron:
[operations/puppet@production] logstash101[012]: prep for reimaging
Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:
logstash1010.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202105111437_herron_19044_logstash1010_eqiad_wmnet.log.
Completed auto-reimage of hosts:
['logstash1010.eqiad.wmnet']
and were ALL successful.
Change 689166 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] logstash101[012]: use default OS installer version
Change 689166 merged by Herron:
[operations/puppet@production] logstash101[012]: use default OS installer version
Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:
logstash1010.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202105111707_herron_8861_logstash1010_eqiad_wmnet.log.
Change 689189 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] scap: switch logstash check host to logstash1023
Completed auto-reimage of hosts:
['logstash1010.eqiad.wmnet']
and were ALL successful.
Change 689189 merged by Herron:
[operations/puppet@production] scap: switch logstash check host to logstash1023
Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:
logstash1011.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202105111827_herron_19227_logstash1011_eqiad_wmnet.log.
Completed auto-reimage of hosts:
['logstash1011.eqiad.wmnet']
and were ALL successful.
Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:
logstash1012.eqiad.wmnet
The log can be found in /var/log/wmf-auto-reimage/202105112109_herron_9184_logstash1012_eqiad_wmnet.log.
Completed auto-reimage of hosts:
['logstash1012.eqiad.wmnet']
and were ALL successful.
Change 689977 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] logstash: remove kibana and elasticsearch from role::logstash
Change 689994 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] logstash: add logstash101[012] to elk7 cluster as ES backends
@herron I acked some alerts related to logstash100[7-9]'s ES on port 9200 not responsive, IIUC we are waiting for https://gerrit.wikimedia.org/r/689977 for the clean up right?
Change 689977 merged by Herron:
[operations/puppet@production] logstash: remove kibana and elasticsearch from role::logstash
Change 690532 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] logstash: remove elasticsearch from role::logstash
Change 690532 merged by Herron:
[operations/puppet@production] logstash: remove elasticsearch from role::logstash
Change 689994 merged by Herron:
[operations/puppet@production] logstash: add logstash101[012] to elk7 cluster as ES backends
Change 693436 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] logstash: remove references to hosts logstash102[012]
Change 693436 merged by Herron:
[operations/puppet@production] logstash: remove references to hosts logstash102[012]
cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: logstash1020.eqiad.wmnet
- logstash1020.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found physical host
- Downtimed management interface on Icinga
- Wiped all swraid, partition-table and filesystem signatures
- Powered off
- Set Netbox status to Decommissioning and deleted all non-mgmt interfaces and related IPs
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
Change 701611 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] logstash: add logstash200[123] to v7 cluster
Change 701611 merged by Herron:
[operations/puppet@production] logstash: add logstash200[123] to v7 cluster
Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:
logstash2001.codfw.wmnet
The log can be found in /var/log/wmf-auto-reimage/202107211549_herron_15145_logstash2001_codfw_wmnet.log.
Completed auto-reimage of hosts:
['logstash2001.codfw.wmnet']
and were ALL successful.
Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:
logstash2002.codfw.wmnet
The log can be found in /var/log/wmf-auto-reimage/202107211643_herron_31177_logstash2002_codfw_wmnet.log.
Completed auto-reimage of hosts:
['logstash2002.codfw.wmnet']
and were ALL successful.
Script wmf-auto-reimage was launched by herron on cumin1001.eqiad.wmnet for hosts:
logstash2003.codfw.wmnet
The log can be found in /var/log/wmf-auto-reimage/202107211747_herron_23546_logstash2003_codfw_wmnet.log.
Completed auto-reimage of hosts:
['logstash2003.codfw.wmnet']
and were ALL successful.
Change 708311 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] logstash: remove references to logstash202[012]
Change 708311 merged by Herron:
[operations/puppet@production] logstash: remove references to logstash202[012]
All elk5 hardware has been decommed at this point.
We have 3 Ganeti VMs per-site remaining which are needed to handle the legacy logstash inputs. We're working to retire these too, but I think for the purposes of decomming hardware and reducing the elk5 surface area for updates and reboots we are good here and can move over to T227080 for tracking the retirement of the remaining legacy inputs.
Reopening this as progress has been made retiring legacy log inputs and now we're ready to move on to decom of the Ganeti VMs.
In prep for shutting down the Ganeti VMs we will need to relocate the api-feature-usage outputs to another logstash cluster, linking a task for that step.
Change 755467 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] logstash: move elk5 collectors to role::spare::system
Change 755467 merged by Herron:
[operations/puppet@production] logstash: move elk5 collectors to role::spare::system
Change 755477 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] set kibana and kibana-ssl monitoring to non-critical
Change 755477 merged by Herron:
[operations/puppet@production] set kibana and kibana-ssl monitoring to non-critical
Change 755480 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] remove elk5 related LVS services
There are icinga alerts open relating to this which I have acked. It's Not ideal for those to be open for any length of time, but erring on the side of caution for a proper review before moving forward to avoid additional false positives or worse.
Change 755480 merged by Herron:
[operations/puppet@production] remove elk5 related LVS services
cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: logstash[1007-1009].eqiad.wmnet
- logstash1007.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- logstash1008.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- logstash1009.eqiad.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqiad.wmnet to Netbox
cookbooks.sre.hosts.decommission executed by herron@cumin1001 for hosts: logstash[2004-2006].codfw.wmnet
- logstash2004.codfw.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- logstash2005.codfw.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- logstash2006.codfw.wmnet (PASS)
- Downtimed host on Icinga
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster ganeti01.svc.codfw.wmnet to Netbox
Change 856521 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/puppet@production] Remove puppet leftovers of old ELK5 hosts
Change 856521 merged by Herron:
[operations/puppet@production] Remove puppet leftovers of old ELK5 hosts
Change 856934 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/puppet@production] Remove obsolete Puppet references related to decomissioned ELK5 clusters
Change 856934 merged by Herron:
[operations/puppet@production] Remove obsolete Puppet references related to decomissioned ELK5 clusters