Page MenuHomePhabricator

Upgrade ELK Stack
Open, MediumPublic

Description

Tracking task for upgrading the ELK stack to a more current stable release (targeting version 7.2)

High level items

  1. Build an ELK 7 upgrade environment in parallel to production
    • Provision ES 7 hosts (HW & OS)
    • Provision Logstash/Kibana 7 collector hosts (VM & OS)
    • Make new versions of ELK software installable via apt
    • Puppetize logging ES 7
    • Puppetize Logstash 7
    • Puppetize Kibana 7
    • Configure service address for load balanced Kibana frontend

2. Determine legal viability of amazon open distro for elasticsearch, if so
[] Integrate RBAC features with LDAP
[] Puppetize management of security users, roles, mappings, etc.

  1. Ingest production logs
    • Determine best way to handle/manage logstash plugins in the new version & execute
    • Consume from kafka-logging
    • Determine best method to bridge gap for ingesting log sources not not yet in Kafka
    • Validate log parsing, storage, etc.
    • Investigate and upgrade/adapt curator as necessary
    • Import Kibana configuration (saved searches, dashboards, visualizations, etc.)

4. Determine if alerting features should be enabled, if so...
[] document guidelines for alerting functionality

  1. Overall validation and cut over
    • Provide access to new environment widely, with old env still available as a backup. (https://logstash-next.wikimedia.org)
      • Gather/address bugs identified during this period (to be expanded as we gain better understanding/experience here)
    • Perform cut-over (name switch to logstash.wm.o)
  2. Migrate Kafka-logging brokers to ELK 7 cluster
  3. Fold (reimage/migrate) ELK 5 hardware into ELK7 cluster
  4. Retire ELK 5 VMs

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+2 -2
operations/puppetproduction+14 -1
operations/puppetproduction+0 -12
operations/puppetproduction+11 -0
operations/puppetproduction+1 -1
operations/puppetproduction+115 -7
operations/puppetproduction+10 -0
operations/dnsmaster+1 -0
operations/puppetproduction+24 -22
operations/puppetproduction+1 -1
operations/puppetproduction+51 -0
operations/puppetproduction+8 -0
operations/puppetproduction+13 -0
operations/puppetproduction+4 -2
operations/puppetproduction+78 -0
operations/dnsmaster+12 -0
operations/puppetproduction+57 -0
operations/puppetproduction+4 -2
operations/puppetproduction+86 -74
operations/puppetproduction+0 -79
operations/puppetproduction+517 -2
operations/puppetproduction+1 -0
operations/puppetproduction+8 -3
operations/puppetproduction+3 -0
operations/puppetproduction+147 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+27 -1
operations/puppetproduction+85 -0
operations/dnsmaster+12 -0
operations/puppetproduction+525 -19
operations/puppetproduction+5 -5
operations/puppetproduction+6 -0
operations/puppetproduction+12 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 552881 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: create elk7 logstash role and assign to elk7 collectors

https://gerrit.wikimedia.org/r/552881

Change 552837 merged by Herron:
[operations/puppet@production] logstash: create elk7 ES role and assign to elk7 ES hw hosts

https://gerrit.wikimedia.org/r/552837

Change 554095 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elasticsearch: add buster openjdk 8 repository

https://gerrit.wikimedia.org/r/554095

Change 554095 merged by Herron:
[operations/puppet@production] elasticsearch: add buster openjdk 8 repository

https://gerrit.wikimedia.org/r/554095

Change 554101 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: set elk7 es config_version to 7

https://gerrit.wikimedia.org/r/554101

Change 554101 merged by Herron:
[operations/puppet@production] logstash: set elk7 es config_version to 7

https://gerrit.wikimedia.org/r/554101

Change 554103 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: set elk7 es heap_memory to 24G

https://gerrit.wikimedia.org/r/554103

Change 554103 merged by Herron:
[operations/puppet@production] logstash: set elk7 es heap_memory to 24G

https://gerrit.wikimedia.org/r/554103

Change 552881 merged by Herron:
[operations/puppet@production] logstash: create elk7 logstash role and assign to elk7 collectors

https://gerrit.wikimedia.org/r/552881

Change 554152 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add logstash_package param and set elk7 to -oss variant

https://gerrit.wikimedia.org/r/554152

Change 554152 merged by Herron:
[operations/puppet@production] logstash: add logstash_package param and set elk7 to -oss variant

https://gerrit.wikimedia.org/r/554152

Change 554157 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kibana: add kibana_package param and set elk7 hosts to -oss variant

https://gerrit.wikimedia.org/r/554157

Change 554157 merged by Herron:
[operations/puppet@production] kibana: add kibana_package param and set elk7 hosts to -oss variant

https://gerrit.wikimedia.org/r/554157

Change 554160 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: set es config_version to 7 on elk7 hosts

https://gerrit.wikimedia.org/r/554160

Change 554160 merged by Herron:
[operations/puppet@production] logstash: set es config_version to 7 on elk7 hosts

https://gerrit.wikimedia.org/r/554160

elukey added a subscriber: elukey.Dec 3 2019, 8:06 AM

Hello! I took the liberty to ack a lot of criticals/unknowns in icinga that were related to these new hosts, IIUC these are not in production :)

Hello! I took the liberty to ack a lot of criticals/unknowns in icinga that were related to these new hosts, IIUC these are not in production :)

Thanks! I've downtimed the new hosts and their services until thurs

Change 554314 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: create elk7 logstash collector profile

https://gerrit.wikimedia.org/r/554314

Change 554314 merged by Herron:
[operations/puppet@production] logstash: create elk7 logstash collector profile

https://gerrit.wikimedia.org/r/554314

Change 554355 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: remove non-kafka inputs from elk7 cluster

https://gerrit.wikimedia.org/r/554355

Change 554355 merged by Herron:
[operations/puppet@production] logstash: remove non-kafka inputs from elk7 cluster

https://gerrit.wikimedia.org/r/554355

Change 554362 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add kafka ssl_endpoint_identification_algorithm param

https://gerrit.wikimedia.org/r/554362

Change 554362 merged by Herron:
[operations/puppet@production] logstash: add kafka ssl_endpoint_identification_algorithm param

https://gerrit.wikimedia.org/r/554362

Change 554472 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: set kafka consumer groups at the role level

https://gerrit.wikimedia.org/r/554472

Change 554472 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: set kafka consumer groups at the role level

https://gerrit.wikimedia.org/r/554472

herron updated the task description. (Show Details)Dec 4 2019, 3:35 PM

Mentioned in SAL (#wikimedia-operations) [2019-12-05T08:03:52Z] <elukey> remove logstash_cleanup_indices_apifeatureusage-search.svc.codfw.wmnet and logstash_cleanup_indices_apifeatureusage-search.svc.eqiad.wmnet from logstash1025,logstash1024,logstash1023,logstash2024,logstash2025 to reduce cronspam - T234854

elukey added a comment.Dec 5 2019, 8:08 AM

Hello :)

From cronspam I can see two errors that happen daily:

Cron[logstash_cleanup_indices_logstash]

elasticsearch.exceptions.ElasticsearchException: Unable to create client connection to Elasticsearch.  Error: Elasticsearch version 7.4.2 incompatible with this version of Curator (5.2.0)
Cron[logstash_cleanup_indices_apifeatureusage-search.svc.codfw.wmnet]
Cron[logstash_cleanup_indices_apifeatureusage-search.svc.eqiad.wmnet]

Error: Invalid value for "--config": Path "/etc/curator/config-apifeatureusage-search.svc.codfw.wmnet.yaml" does not exist.

The latter seems fixed, I removed the crontab entries (since I didn't find anything in puppet related to the hosts sending cronspam) and ran puppet to make sure that those crons were not on the catalog. The former seems to be something to check/review :)

Change 554905 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] lvs: add entries for logstash-next and kibana-next

https://gerrit.wikimedia.org/r/554905

Change 554906 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: add kibana-next and logstash-next service addresses

https://gerrit.wikimedia.org/r/554906

Change 554905 merged by Herron:
[operations/puppet@production] lvs: add entries for logstash-next and kibana-next

https://gerrit.wikimedia.org/r/554905

Change 554906 merged by Herron:
[operations/dns@master] dns: add kibana-next and logstash-next service addresses

https://gerrit.wikimedia.org/r/554906

@herron hello :) Any comment on what I wrote above about cronspam?

@elukey hey, yes that's been fixed by making a newer version of curator available to the new clusters. Haven't seen cron errors from these since Dec 5. Thanks for cleaning up the "config does not exist" entries!

Change 571548 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: output logs ingested by deprecated inputs to kafka-logging

https://gerrit.wikimedia.org/r/571548

Change 571554 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash::collector7 ingest deprecated logs from kafka

https://gerrit.wikimedia.org/r/571554

Change 571622 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add ES 7 compatible logstash template

https://gerrit.wikimedia.org/r/571622

Change 571813 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: remove defalut value from kafka input type field

https://gerrit.wikimedia.org/r/571813

Change 571548 merged by Herron:
[operations/puppet@production] logstash: output logs ingested by deprecated inputs to kafka-logging

https://gerrit.wikimedia.org/r/571548

Change 571813 merged by Herron:
[operations/puppet@production] logstash: remove defalut value from kafka input type field

https://gerrit.wikimedia.org/r/571813

Change 571554 merged by Herron:
[operations/puppet@production] logstash::collector7 ingest deprecated logs from kafka

https://gerrit.wikimedia.org/r/571554

Change 574862 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] add load balancing for kibana-next

https://gerrit.wikimedia.org/r/574862

Change 575320 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] add profile::idp::client::httpd hiera for elk7 env

https://gerrit.wikimedia.org/r/575320

Change 575320 merged by Herron:
[operations/puppet@production] add profile::idp::client::httpd hiera for elk7 env

https://gerrit.wikimedia.org/r/575320

Change 574862 merged by Herron:
[operations/puppet@production] add load balancing for kibana-next

https://gerrit.wikimedia.org/r/574862

Change 575631 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] lvs: kibana-next: promote from "service_setup" to "lvs_setup"

https://gerrit.wikimedia.org/r/575631

Change 575631 merged by Herron:
[operations/puppet@production] lvs: kibana-next: promote from "service_setup" to "lvs_setup"

https://gerrit.wikimedia.org/r/575631

Change 576152 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: add logstash-next.wikimedia.org record

https://gerrit.wikimedia.org/r/576152

Change 576151 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] cache: map logstash-next.wikimedia.org to kibana-next lvs

https://gerrit.wikimedia.org/r/576151

Change 576411 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] add kibana-next SANs to kibana cert

https://gerrit.wikimedia.org/r/576411

Change 576411 merged by Herron:
[operations/puppet@production] add kibana-next SANs to kibana cert

https://gerrit.wikimedia.org/r/576411

Change 576152 abandoned by Herron:
dns: add logstash-next.wikimedia.org record

Reason:
abandoning in favor of a5257d4fc7826c26a6a7e60799b1c71fc789ed65

https://gerrit.wikimedia.org/r/576152

Change 576151 merged by Herron:
[operations/puppet@production] cache: map logstash-next.wikimedia.org and cas-logstash to kibana-next lvs

https://gerrit.wikimedia.org/r/576151

Change 576967 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elasticsearch: add max_clause_count setting

https://gerrit.wikimedia.org/r/576967

herron updated the task description. (Show Details)Mar 5 2020, 5:37 PM

Change 571622 merged by Herron:
[operations/puppet@production] logstash: add ES 7 compatible logstash template

https://gerrit.wikimedia.org/r/571622

Change 579461 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] assign codfw logstash ssd hosts role::insetup

https://gerrit.wikimedia.org/r/579461

Change 579461 merged by Herron:
[operations/puppet@production] assign codfw logstash ssd hosts role::insetup

https://gerrit.wikimedia.org/r/579461

Change 576967 abandoned by Herron:
elasticsearch: add max_clause_count setting

Reason:
going with I2e690d26e5bd4d9961f261eb049f33ef58ad2588 instead

https://gerrit.wikimedia.org/r/576967

herron updated the task description. (Show Details)Mar 30 2020, 3:39 PM
Krinkle added a subscriber: Krinkle.EditedMar 31 2020, 6:29 PM

First impressions of the new Logstash/Kibana based on using Firefox 74 for macOS on an idle high-end MacBook Pro using a fast WiFi connection.

  • It is even slower to load. Just to have the UI appear initially at all now takes 7-8 seconds on logstash-next compared to ~ 1s second on logstash (this is while loading the domain and seeing the "Loading" animation).
  • All interface links and buttons are unresposive. When hovering any link or button (e.g. on any dashboard the "Close", "Show dates", "Lucene" or "Refresh" buttons) they are without a pointer cursor for the first 1-2 seconds before they can be clicked.
    • This also applies to modal interfaces such as the "edit filter" overlay, and the date inputs.
    • This is actually really difficult to screw up in a modern browser, so I'm kind of impressed they managed to make the UI this bad.
  • As a silver lining, they seem to have finally fixed the autocomplete widget for "Edit filter". It no longer tries to preload all 90 days of Logstash indexes client-side and iterate over every unique field on every keystroke (which is what led to T189333). Instead, this data is now lazy-loaded in chunks and filtering is debounced properly, resulting in an input field that is now actually usable, in all browsers I tried. Yay!

The codfw cluster is currently yellow, from explain I see a lot of "explanation" : "node does not match index setting [index.routing.allocation.require] filters [disktype:\"hdd\"]"

I acked the alerts since the notifications were turned off.

there was some work to rotate old indexes to spinning disks but the cluster knew of no nodes with the "hdd" disktype attribute. it looks like the configuration was stale and restarting logstash[2021-2022] allowed the indexes to be assigned.

brennen added a subscriber: brennen.May 4 2020, 5:31 PM

Since Elastic stack 7.7 has been released I think it'd make sense we upgrade to that before the switch, supposedly there have been improvements to memory usage!

fgiunchedi added a subtask: Restricted Task.Jun 22 2020, 9:36 AM

Change 609397 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: decom check_procs

https://gerrit.wikimedia.org/r/609397

Change 609397 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: decom check_procs

https://gerrit.wikimedia.org/r/609397

Change 610079 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] add thirdparty/elastic78 component

https://gerrit.wikimedia.org/r/610079

Change 610079 merged by Herron:
[operations/puppet@production] add thirdparty/elastic78 component

https://gerrit.wikimedia.org/r/610079

Change 610135 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: set v7 cluster to version 7.8

https://gerrit.wikimedia.org/r/610135

Change 610135 merged by Herron:
[operations/puppet@production] logstash: set v7 cluster to version 7.8

https://gerrit.wikimedia.org/r/610135

Mentioned in SAL (#wikimedia-operations) [2020-07-09T19:16:27Z] <herron> upgraded eqiad elk7 cluster from 7.4.2 to 7.8.0 T234854

I am getting a lot of 500 internal server errors on logstash-next instance. I am guessing that is expected/WIP?