Page MenuHomePhabricator

Upgrade ELK Stack
Open, MediumPublic

Description

Tracking task for upgrading the ELK stack to a more current stable release (targeting version 7.2)

High level items

  1. Build an ELK 7 upgrade environment in parallel to production
    • Provision ES 7 hosts (HW & OS)
    • Provision Logstash/Kibana 7 collector hosts (VM & OS)
    • Make new versions of ELK software installable via apt
    • Puppetize logging ES 7
    • Puppetize Logstash 7
    • Puppetize Kibana 7
    • Configure service address for load balanced Kibana frontend
  2. Determine legal viability of amazon open distro for elasticsearch, if so
    • Integrate RBAC features with LDAP
    • Puppetize management of security users, roles, mappings, etc.
  3. Ingest production logs
    • Verify APIFeatureUsage / cirrus cluster compatibility
    • Determine best way to handle/manage logstash plugins in the new version & execute
    • Consume from kafka-logging
    • Determine best method to bridge gap for ingesting log sources not not yet in Kafka
    • Validate log parsing, storage, etc.
    • Investigate and upgrade/adapt curator as necessary
    • Import Kibana configuration (saved searches, dashboards, visualizations, etc.)
  4. Determine if alerting features should be enabled, if so...
    • document guidelines for alerting functionality
  5. Overall validation and cut over
    • Provide access to new environment widely, with old env still available as a backup.
      • Gather/address bugs identified during this period (to be expanded as we gain better understanding/experience here)
    • Determine best cut-over method & execute
  6. Migrate Kafka-logging brokers to ELK 7 cluster
  7. Fold (reimage/migrate) ELK 5 hardware into ELK7 cluster
  8. Retire ELK 5 VMs

Details

Related Gerrit Patches:
operations/dns : masterdns: add kibana-next and logstash-next service addresses
operations/puppet : productionlvs: add entries for logstash-next and kibana-next
operations/puppet : productionlogstash: set kafka consumer groups at the role level
operations/puppet : productionlogstash: add kafka ssl_endpoint_identification_algorithm param
operations/puppet : productionlogstash: remove non-kafka inputs from elk7 cluster
operations/puppet : productionlogstash: create elk7 logstash collector profile
operations/puppet : productionlogstash: set es config_version to 7 on elk7 hosts
operations/puppet : productionkibana: add kibana_package param and set elk7 hosts to -oss variant
operations/puppet : productionlogstash: add logstash_package param and set elk7 to -oss variant
operations/puppet : productionlogstash: create elk7 logstash role and assign to elk7 collectors
operations/puppet : productionlogstash: set elk7 es heap_memory to 24G
operations/puppet : productionlogstash: set elk7 es config_version to 7
operations/puppet : productionelasticsearch: add buster openjdk 8 repository
operations/puppet : productionlogstash: create elk7 ES role and assign to elk7 ES hw hosts
operations/dns : masteradd forwad/reverse entries for logstash 7 collector hosts
operations/puppet : productionIntroduce Elastic 7 support
operations/puppet : productionaptrepo: include minor version in elastic 7 repos
operations/puppet : productioninstall_server: use Buster for elastic 7 cluster
operations/puppet : productionaptrepo: add elastic 7

Event Timeline

herron triaged this task as Medium priority.Oct 7 2019, 8:05 PM
herron created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 7 2019, 8:05 PM
Paladox added a subscriber: Paladox.Oct 7 2019, 8:09 PM
herron updated the task description. (Show Details)Oct 8 2019, 3:14 PM
herron updated the task description. (Show Details)Oct 10 2019, 4:29 PM

Change 545786 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] aptrepo: add elastic 7

https://gerrit.wikimedia.org/r/545786

Gehel added a subscriber: Gehel.Oct 24 2019, 1:32 PM

Note that APIFeatureUsage has the ELK cluster talk to the Cirrus elasticsearch cluster directly. This means that logstash version on ELK needs to be compatible with the elasticsearch version on Cirrus. There is no mention of APIFeatureUsage in the checklist above, but maybe it should be added (we've been bitten before by that one).

Note that there is a task to remove that dependency (T217742), but no actual work has been done yet.

Change 545867 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] Introduce Elastic 7 support

https://gerrit.wikimedia.org/r/545867

fgiunchedi updated the task description. (Show Details)Oct 24 2019, 2:51 PM

Note that APIFeatureUsage has the ELK cluster talk to the Cirrus elasticsearch cluster directly. This means that logstash version on ELK needs to be compatible with the elasticsearch version on Cirrus. There is no mention of APIFeatureUsage in the checklist above, but maybe it should be added (we've been bitten before by that one).
Note that there is a task to remove that dependency (T217742), but no actual work has been done yet.

Thanks for the note! I've added cirrus compatibility to the checklists now, from a quick glance it _seems_ we should be fine, as in logstash 7.4 is compatibile with elastic 6.8 (cirrus is running 6.5.4)

List of breaking changes for 6.[678]:

Change 545786 merged by Filippo Giunchedi:
[operations/puppet@production] aptrepo: add elastic 7

https://gerrit.wikimedia.org/r/545786

Change 546876 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] install_server: use Buster for elastic 7 cluster

https://gerrit.wikimedia.org/r/546876

Change 546876 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: use Buster for elastic 7 cluster

https://gerrit.wikimedia.org/r/546876

fgiunchedi updated the task description. (Show Details)Oct 29 2019, 2:32 PM

Change 547161 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] aptrepo: include minor version in elastic 7 repos

https://gerrit.wikimedia.org/r/547161

Change 547161 merged by Filippo Giunchedi:
[operations/puppet@production] aptrepo: include minor version in elastic 7 repos

https://gerrit.wikimedia.org/r/547161

Change 545867 merged by Filippo Giunchedi:
[operations/puppet@production] Introduce Elastic 7 support

https://gerrit.wikimedia.org/r/545867

Change 552567 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] add forwad/reverse entries for logstash 7 collector hosts

https://gerrit.wikimedia.org/r/552567

Change 552567 merged by Herron:
[operations/dns@master] add forwad/reverse entries for logstash 7 collector hosts

https://gerrit.wikimedia.org/r/552567

Change 552837 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: create elk7 ES role and assign to elk7 ES hw hosts

https://gerrit.wikimedia.org/r/552837

Change 552881 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: create elk7 logstash role and assign to elk7 collectors

https://gerrit.wikimedia.org/r/552881

Change 552837 merged by Herron:
[operations/puppet@production] logstash: create elk7 ES role and assign to elk7 ES hw hosts

https://gerrit.wikimedia.org/r/552837

Change 554095 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elasticsearch: add buster openjdk 8 repository

https://gerrit.wikimedia.org/r/554095

Change 554095 merged by Herron:
[operations/puppet@production] elasticsearch: add buster openjdk 8 repository

https://gerrit.wikimedia.org/r/554095

Change 554101 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: set elk7 es config_version to 7

https://gerrit.wikimedia.org/r/554101

Change 554101 merged by Herron:
[operations/puppet@production] logstash: set elk7 es config_version to 7

https://gerrit.wikimedia.org/r/554101

Change 554103 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: set elk7 es heap_memory to 24G

https://gerrit.wikimedia.org/r/554103

Change 554103 merged by Herron:
[operations/puppet@production] logstash: set elk7 es heap_memory to 24G

https://gerrit.wikimedia.org/r/554103

Change 552881 merged by Herron:
[operations/puppet@production] logstash: create elk7 logstash role and assign to elk7 collectors

https://gerrit.wikimedia.org/r/552881

Change 554152 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add logstash_package param and set elk7 to -oss variant

https://gerrit.wikimedia.org/r/554152

Change 554152 merged by Herron:
[operations/puppet@production] logstash: add logstash_package param and set elk7 to -oss variant

https://gerrit.wikimedia.org/r/554152

Change 554157 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kibana: add kibana_package param and set elk7 hosts to -oss variant

https://gerrit.wikimedia.org/r/554157

Change 554157 merged by Herron:
[operations/puppet@production] kibana: add kibana_package param and set elk7 hosts to -oss variant

https://gerrit.wikimedia.org/r/554157

Change 554160 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: set es config_version to 7 on elk7 hosts

https://gerrit.wikimedia.org/r/554160

Change 554160 merged by Herron:
[operations/puppet@production] logstash: set es config_version to 7 on elk7 hosts

https://gerrit.wikimedia.org/r/554160

Hello! I took the liberty to ack a lot of criticals/unknowns in icinga that were related to these new hosts, IIUC these are not in production :)

Hello! I took the liberty to ack a lot of criticals/unknowns in icinga that were related to these new hosts, IIUC these are not in production :)

Thanks! I've downtimed the new hosts and their services until thurs

Change 554314 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: create elk7 logstash collector profile

https://gerrit.wikimedia.org/r/554314

Change 554314 merged by Herron:
[operations/puppet@production] logstash: create elk7 logstash collector profile

https://gerrit.wikimedia.org/r/554314

Change 554355 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: remove non-kafka inputs from elk7 cluster

https://gerrit.wikimedia.org/r/554355

Change 554355 merged by Herron:
[operations/puppet@production] logstash: remove non-kafka inputs from elk7 cluster

https://gerrit.wikimedia.org/r/554355

Change 554362 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: add kafka ssl_endpoint_identification_algorithm param

https://gerrit.wikimedia.org/r/554362

Change 554362 merged by Herron:
[operations/puppet@production] logstash: add kafka ssl_endpoint_identification_algorithm param

https://gerrit.wikimedia.org/r/554362

Change 554472 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: set kafka consumer groups at the role level

https://gerrit.wikimedia.org/r/554472

Change 554472 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: set kafka consumer groups at the role level

https://gerrit.wikimedia.org/r/554472

herron updated the task description. (Show Details)Dec 4 2019, 3:35 PM

Mentioned in SAL (#wikimedia-operations) [2019-12-05T08:03:52Z] <elukey> remove logstash_cleanup_indices_apifeatureusage-search.svc.codfw.wmnet and logstash_cleanup_indices_apifeatureusage-search.svc.eqiad.wmnet from logstash1025,logstash1024,logstash1023,logstash2024,logstash2025 to reduce cronspam - T234854

Hello :)

From cronspam I can see two errors that happen daily:

Cron[logstash_cleanup_indices_logstash]

elasticsearch.exceptions.ElasticsearchException: Unable to create client connection to Elasticsearch.  Error: Elasticsearch version 7.4.2 incompatible with this version of Curator (5.2.0)
Cron[logstash_cleanup_indices_apifeatureusage-search.svc.codfw.wmnet]
Cron[logstash_cleanup_indices_apifeatureusage-search.svc.eqiad.wmnet]

Error: Invalid value for "--config": Path "/etc/curator/config-apifeatureusage-search.svc.codfw.wmnet.yaml" does not exist.

The latter seems fixed, I removed the crontab entries (since I didn't find anything in puppet related to the hosts sending cronspam) and ran puppet to make sure that those crons were not on the catalog. The former seems to be something to check/review :)

Change 554905 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] lvs: add entries for logstash-next and kibana-next

https://gerrit.wikimedia.org/r/554905

Change 554906 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: add kibana-next and logstash-next service addresses

https://gerrit.wikimedia.org/r/554906

Change 554905 merged by Herron:
[operations/puppet@production] lvs: add entries for logstash-next and kibana-next

https://gerrit.wikimedia.org/r/554905

Change 554906 merged by Herron:
[operations/dns@master] dns: add kibana-next and logstash-next service addresses

https://gerrit.wikimedia.org/r/554906

@herron hello :) Any comment on what I wrote above about cronspam?

@elukey hey, yes that's been fixed by making a newer version of curator available to the new clusters. Haven't seen cron errors from these since Dec 5. Thanks for cleaning up the "config does not exist" entries!