Page MenuHomePhabricator

Upgrade ELK Stack to version 7
Open, MediumPublic

Description

Tracking task for upgrading the ELK stack to a more current stable release (targeting version 7.2)

High level items

  1. Build an ELK 7 upgrade environment in parallel to production
    • Provision ES 7 hosts (HW & OS)
    • Provision Logstash/Kibana 7 collector hosts (VM & OS)
    • Make new versions of ELK software installable via apt
    • Puppetize logging ES 7
    • Puppetize Logstash 7
    • Puppetize Kibana 7
    • Configure service address for load balanced Kibana frontend

2. Determine legal viability of amazon open distro for elasticsearch, if so
[] Integrate RBAC features with LDAP
[] Puppetize management of security users, roles, mappings, etc.

  1. Ingest production logs
    • Determine best way to handle/manage logstash plugins in the new version & execute
    • Consume from kafka-logging
    • Determine best method to bridge gap for ingesting log sources not not yet in Kafka
    • Validate log parsing, storage, etc.
    • Investigate and upgrade/adapt curator as necessary
    • Import Kibana configuration (saved searches, dashboards, visualizations, etc.)

4. Determine if alerting features should be enabled, if so...
[] document guidelines for alerting functionality

  1. Overall validation and cut over
    • Provide access to new environment widely, with old env still available as a backup. (https://logstash-next.wikimedia.org)
      • Gather/address bugs identified during this period
    • Perform cut-over (name switch to logstash.wm.o)
  2. Migrate Kafka-logging brokers to ELK 7 cluster
  3. Fold (reimage/migrate) ELK 5 hardware into ELK7 cluster
  4. Retire ELK 5 VMs

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+0 -187
operations/puppetproduction+20 -20
operations/puppetproduction+0 -2
operations/puppetproduction+2 -2
operations/dnsmaster+0 -1
operations/puppetproduction+0 -38
operations/puppetproduction+0 -3
operations/puppetproduction+1 -1
operations/puppetproduction+8 -8
operations/puppetproduction+12 -0
operations/dnsmaster+2 -0
operations/puppetproduction+0 -12
operations/puppetproduction+2 -2
operations/puppetproduction+14 -1
operations/puppetproduction+0 -12
operations/puppetproduction+11 -0
operations/puppetproduction+1 -1
operations/puppetproduction+115 -7
operations/puppetproduction+10 -0
operations/dnsmaster+1 -0
operations/puppetproduction+24 -22
operations/puppetproduction+1 -1
operations/puppetproduction+51 -0
operations/puppetproduction+8 -0
operations/puppetproduction+13 -0
operations/puppetproduction+4 -2
operations/puppetproduction+78 -0
operations/dnsmaster+12 -0
operations/puppetproduction+57 -0
operations/puppetproduction+4 -2
operations/puppetproduction+86 -74
operations/puppetproduction+0 -79
operations/puppetproduction+517 -2
operations/puppetproduction+1 -0
operations/puppetproduction+8 -3
operations/puppetproduction+3 -0
operations/puppetproduction+147 -0
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+27 -1
operations/puppetproduction+85 -0
operations/dnsmaster+12 -0
operations/puppetproduction+525 -19
operations/puppetproduction+5 -5
operations/puppetproduction+6 -0
operations/puppetproduction+12 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 571813 merged by Herron:
[operations/puppet@production] logstash: remove defalut value from kafka input type field

https://gerrit.wikimedia.org/r/571813

Change 571554 merged by Herron:
[operations/puppet@production] logstash::collector7 ingest deprecated logs from kafka

https://gerrit.wikimedia.org/r/571554

Change 574862 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] add load balancing for kibana-next

https://gerrit.wikimedia.org/r/574862

Change 575320 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] add profile::idp::client::httpd hiera for elk7 env

https://gerrit.wikimedia.org/r/575320

Change 575320 merged by Herron:
[operations/puppet@production] add profile::idp::client::httpd hiera for elk7 env

https://gerrit.wikimedia.org/r/575320

Change 574862 merged by Herron:
[operations/puppet@production] add load balancing for kibana-next

https://gerrit.wikimedia.org/r/574862

Change 575631 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] lvs: kibana-next: promote from "service_setup" to "lvs_setup"

https://gerrit.wikimedia.org/r/575631

Change 575631 merged by Herron:
[operations/puppet@production] lvs: kibana-next: promote from "service_setup" to "lvs_setup"

https://gerrit.wikimedia.org/r/575631

Change 576152 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: add logstash-next.wikimedia.org record

https://gerrit.wikimedia.org/r/576152

Change 576151 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] cache: map logstash-next.wikimedia.org to kibana-next lvs

https://gerrit.wikimedia.org/r/576151

Change 576411 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] add kibana-next SANs to kibana cert

https://gerrit.wikimedia.org/r/576411

Change 576411 merged by Herron:
[operations/puppet@production] add kibana-next SANs to kibana cert

https://gerrit.wikimedia.org/r/576411

Change 576152 abandoned by Herron:
dns: add logstash-next.wikimedia.org record

Reason:
abandoning in favor of a5257d4fc7826c26a6a7e60799b1c71fc789ed65

https://gerrit.wikimedia.org/r/576152

Change 576151 merged by Herron:
[operations/puppet@production] cache: map logstash-next.wikimedia.org and cas-logstash to kibana-next lvs

https://gerrit.wikimedia.org/r/576151

Change 576967 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elasticsearch: add max_clause_count setting

https://gerrit.wikimedia.org/r/576967

herron updated the task description. (Show Details)Mar 5 2020, 5:37 PM

Change 571622 merged by Herron:
[operations/puppet@production] logstash: add ES 7 compatible logstash template

https://gerrit.wikimedia.org/r/571622

Change 579461 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] assign codfw logstash ssd hosts role::insetup

https://gerrit.wikimedia.org/r/579461

Change 579461 merged by Herron:
[operations/puppet@production] assign codfw logstash ssd hosts role::insetup

https://gerrit.wikimedia.org/r/579461

Change 576967 abandoned by Herron:
elasticsearch: add max_clause_count setting

Reason:
going with I2e690d26e5bd4d9961f261eb049f33ef58ad2588 instead

https://gerrit.wikimedia.org/r/576967

herron updated the task description. (Show Details)Mar 30 2020, 3:39 PM
Krinkle added a subscriber: Krinkle.EditedMar 31 2020, 6:29 PM

First impressions of the new Logstash/Kibana based on using Firefox 74 for macOS on an idle high-end MacBook Pro using a fast WiFi connection.

  • It is even slower to load. Just to have the UI appear initially at all now takes 7-8 seconds on logstash-next compared to ~ 1s second on logstash (this is while loading the domain and seeing the "Loading" animation).
  • All interface links and buttons are unresposive. When hovering any link or button (e.g. on any dashboard the "Close", "Show dates", "Lucene" or "Refresh" buttons) they are without a pointer cursor for the first 1-2 seconds before they can be clicked.
    • This also applies to modal interfaces such as the "edit filter" overlay, and the date inputs.
    • This is actually really difficult to screw up in a modern browser, so I'm kind of impressed they managed to make the UI this bad.
  • As a silver lining, they seem to have finally fixed the autocomplete widget for "Edit filter". It no longer tries to preload all 90 days of Logstash indexes client-side and iterate over every unique field on every keystroke (which is what led to T189333). Instead, this data is now lazy-loaded in chunks and filtering is debounced properly, resulting in an input field that is now actually usable, in all browsers I tried. Yay!

The codfw cluster is currently yellow, from explain I see a lot of "explanation" : "node does not match index setting [index.routing.allocation.require] filters [disktype:\"hdd\"]"

I acked the alerts since the notifications were turned off.

there was some work to rotate old indexes to spinning disks but the cluster knew of no nodes with the "hdd" disktype attribute. it looks like the configuration was stale and restarting logstash[2021-2022] allowed the indexes to be assigned.

brennen added a subscriber: brennen.May 4 2020, 5:31 PM

Since Elastic stack 7.7 has been released I think it'd make sense we upgrade to that before the switch, supposedly there have been improvements to memory usage!

fgiunchedi added a subtask: Restricted Task.Jun 22 2020, 9:36 AM

Change 609397 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] logstash: decom check_procs

https://gerrit.wikimedia.org/r/609397

Change 609397 merged by Filippo Giunchedi:
[operations/puppet@production] logstash: decom check_procs

https://gerrit.wikimedia.org/r/609397

Change 610079 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] add thirdparty/elastic78 component

https://gerrit.wikimedia.org/r/610079

Change 610079 merged by Herron:
[operations/puppet@production] add thirdparty/elastic78 component

https://gerrit.wikimedia.org/r/610079

Change 610135 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] logstash: set v7 cluster to version 7.8

https://gerrit.wikimedia.org/r/610135

Change 610135 merged by Herron:
[operations/puppet@production] logstash: set v7 cluster to version 7.8

https://gerrit.wikimedia.org/r/610135

Mentioned in SAL (#wikimedia-operations) [2020-07-09T19:16:27Z] <herron> upgraded eqiad elk7 cluster from 7.4.2 to 7.8.0 T234854

I am getting a lot of 500 internal server errors on logstash-next instance. I am guessing that is expected/WIP?

I am getting a lot of 500 internal server errors on logstash-next instance. I am guessing that is expected/WIP?

Not necessarily expected, but is being tracked in T259219 and a fix has been put in place

Mentioned in SAL (#wikimedia-operations) [2020-08-26T18:08:44Z] <herron> upgraded eqiad elk v7 cluster from 7.8.0 to 7.9.0 T234854

Krinkle renamed this task from Upgrade ELK Stack to Upgrade ELK Stack to version 7.Aug 26 2020, 6:16 PM

Another thing I tracked is that dashboards doesn't seem to do proper filtering. Do dashboards have to be fully redone for 7? I am assuming dashborad definitions have been automatically imported.

If I compare:

with the original

We can see than on the original, channel DBQuery logs are rare (queries rarely fail unless a very bad thing is happening), however at the time of the comment, I am getting 1 result for the old install and dozens of thousands on the new one. Something's wrong because on the host list there are hosts like "kubernetes2013" and "gerrit1001" which wouldn't be doing mediawiki queries. Are the original dashboards wrong? Did the syntax to create them changed, or there is another reason? This is not the only dashboard affected (do all dashboards need redo?). I can create a task if you believe this is a real issue.

colewhite added a comment.EditedSep 7 2020, 1:53 PM

@jcrespo Thanks for bringing this to our attention. The filters on that dashboard indicate they are broken because the filter pattern logstash-* cannot be found on logstash-next.

Trying to load: https://logstash-next.wikimedia.org/app/dashboards#/view/6bcd2a10-7d21-11e7-86fb-51c84229aeb7

My laptop fan starts spinning very hard, everything times out and I get this popup:

Hmm, I'm not able to reproduce this consistently. On occasion the slow query popup does appear, but have not been able to produce any full timeouts. Is this occurring for you with a 15m time selection?

herron updated the task description. (Show Details)Oct 28 2020, 7:44 PM

I was able to load it successfully this time on both Chrome and Firefox, it's very slow though. This is all with the default 15min.

All the filters have a red "error" after them.

Surprisingly, the content of the dashboard visualizations are different between the 2 instances (looks like different sources, -next is about mediawiki, not networks).

Is there a easy to read changelog for end-users like me?

@jcrespo Thanks for bringing this to our attention. The filters on that dashboard indicate they are broken because the filter pattern logstash-* cannot be found on logstash-next.

Could you elaborate on this- does this mean "don't worry, it will be right when the cut-over happens" or "you should do X actionable beforehand to fix it"? Sorry, I don't have lot of experience about kibana internals, but this dashboard is important to us. CC @Marostegui

First impressions of the new Logstash/Kibana based on using Firefox 74 for macOS on an idle high-end MacBook Pro using a fast WiFi connection.

  • It is even slower to load. Just to have the UI appear initially at all now takes 7-8 seconds on logstash-next compared to ~ 1s second on logstash (this is while loading the domain and seeing the "Loading" animation).
  • All interface links and buttons are unresposive. When hovering any link or button (e.g. on any dashboard the "Close", "Show dates", "Lucene" or "Refresh" buttons) they are without a pointer cursor for the first 1-2 seconds before they can be clicked.
    • This also applies to modal interfaces such as the "edit filter" overlay, and the date inputs.
    • This is actually really difficult to screw up in a modern browser, so I'm kind of impressed they managed to make the UI this bad.
  • As a silver lining, they seem to have finally fixed the autocomplete widget for "Edit filter". It no longer tries to preload all 90 days of Logstash indexes client-side and iterate over every unique field on every keystroke (which is what led to T189333). […]

The above is still the case. I think to the extent possible we should remain on Kibana 6 until and unless these are addressed by upstream, or for us to invest in a basic replacement that can do the minimum visualisations and permalink access we actually want/need in a way that doesn't routinely crash browsers or cost an hour to perform a simple triage task.

colewhite added a comment.EditedNov 4 2020, 8:27 PM

Could you elaborate on this- does this mean "don't worry, it will be right when the cut-over happens" or "you should do X actionable beforehand to fix it"? Sorry, I don't have lot of experience about kibana internals, but this dashboard is important to us. CC @Marostegui

@jcrespo This issue should be resolved at this point as I now see the logstash-* filter pattern on logstash-next. Please let us know if it is still broken.

  • It is even slower to load. Just to have the UI appear initially at all now takes 7-8 seconds on logstash-next compared to ~ 1s second on logstash (this is while loading the domain and seeing the "Loading" animation).

I can confirm the UI itself has a longer preload time than Kibana 5. Wall clock from login to rendered home dashboard:

Kibana 7~10s
Kibana 5~7s

Timing the dashboards implies there is a difference in how Visualizations are rendered and the render is indeed much slower. The "mediawiki-errors" dashboard for example:

Kibana 7~26s
Kibana 5~13s

7.10 should drop any day now. It's worth seeing what dropping the call to calculateObjectHash() will do for Visualization performance.

@jcrespo This issue should be resolved at this point as I now see the logstash-* filter pattern on logstash-next. Please let us know if it is still broken.

If you click at the links I sent at T234854#6439791 you can see that I get (as of this writing) no errors on DBQuery on the original dashboard and several hundred of thousands of messages on "next" as filtering is not working- I believe it is displaying all logs ("Disabled Index pattern logstash-* not found"). This error is happening on all DB* dashboards: https://logstash-next.wikimedia.org/app/dashboards Maybe I am doing something wrong? Please advice.

If you click at the links I sent at T234854#6439791 you can see that I get (as of this writing) no errors on DBQuery on the original dashboard and several hundred of thousands of messages on "next" as filtering is not working- I believe it is displaying all logs ("Disabled Index pattern logstash-* not found"). This error is happening on all DB* dashboards: https://logstash-next.wikimedia.org/app/dashboards Maybe I am doing something wrong? Please advice.

Thank you for following up on this issue. It seems there is a problem in the migration somewhere. I have amended the dashboards for now; please have a look.

If they are working as expected, we see the problem and will ensure it is fixed prior to the cut-over.

I have amended the dashboards for now; please have a look.

Thanks, they work now.

As a minor issue, one thing I noticed is that page load in the old ones takes 3 seconds, 26 seconds on the new one. This is on the "DBConnection" dashboard (which just filters by type and channel), at a moment where only 2 events are loaded, so it shouldn't be a memory/browser issue. Could it be some indexes are missing on the new setup or something else?

As a minor issue, one thing I noticed is that page load in the old ones takes 3 seconds, 26 seconds on the new one. This is on the "DBConnection" dashboard (which just filters by type and channel), at a moment where only 2 events are loaded, so it shouldn't be a memory/browser issue. Could it be some indexes are missing on the new setup or something else?

Unfortunately, it is the browser causing the delay. Kibana 7.10 is purported to have the fix that should eliminate this unnecessary client-side work.

Good news! Thank you for the work!

herron added a comment.Nov 9 2020, 5:55 PM

I think to the extent possible we should remain on Kibana 6 until and unless these are addressed by upstream

Thanks for the feedback, it is much appreciated. Observability discussed this concern as a team this morning, and yes we are in agreement that we should hold off until addressed upstream. We'll postpone cutover, and will be contacting elastic to get a better understanding of when 7.10 will be released, as a related fix was merged back in Sept (https://github.com/elastic/kibana/pull/77646) but it is not yet clear when updated packages will be made available.

Kibana 7.10 is live. A quick check of performance indicates performance is back to parity with Kibana 5.

Home dashboard: ~7s
The "mediawiki-errors" dashboard: ~13s

Krinkle added a comment.EditedNov 27 2020, 3:10 AM

Kibana 7.10 is live. A quick check of performance indicates performance is back to parity with Kibana 5.

Home dashboard: ~7s
The "mediawiki-errors" dashboard: ~13s

This might be unrepresentative, as it seems none of the filters currently work on Kibana 7.1.

On the various dashboards I checked, all of them have Error for their filter bubbles. This wasn't obvious to me at first since at least one of them actually uses the word "Error" as the value, so it saying "channel: Error" didn't immediately look out of place. But is is actually telling us that the search queries have failed:

When attempting to edit a few of these, it seems they have all turned into blank boxes with a rejected JSON blob as custom DSL (even the ones authored as regular UI-made filter bubble on Kibana 5).

As such, it seems all the dashboards are querying the all source data without any filters (all hosts, all applications/services/programs etc.). The mediawiki-errors dashboard, for example, includes apache2 messages from phab1001, and syslog messages from cp4031, restbase, wdqs2006, etc.

Is there some sort of migration script that exists and needs to be run? Or was there a breaking change in the release that removed support for filters from dashboards on previous versions?

It might also have something to do with the logstash-* index names, which seems to be the first field that shows an error when editing those filter bubbles, in which case might just be a configuration issue.

@Krinkle That looks very similar to the problems I found initially on DB* dashboards, and then they did something to fix it- people here will know more. This was my initial, no longer happening report: T234854#6439791

For additional feedback on my side, I have been using logstash-next for the most part in the last 2 weeks and I am happy with it so far, after the couple of bumps I ran into were solved. Thank you! :-) This stresses the need for beta-testing. Thank you again!

Change 654294 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kibana: change backend naming from kibana-next to kibana7

https://gerrit.wikimedia.org/r/654294

Change 654436 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kibana7: add kibana7 conftool entries

https://gerrit.wikimedia.org/r/654436

Change 654437 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kibana7: repoint (rename) kibana-next services to kibana7

https://gerrit.wikimedia.org/r/654437

Change 654438 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kibana7: remove kibana-next conftool entries

https://gerrit.wikimedia.org/r/654438

herron added a comment.EditedJan 6 2021, 5:43 PM

It might also have something to do with the logstash-* index names, which seems to be the first field that shows an error when editing those filter bubbles, in which case might just be a configuration issue.

Yes this looks to have been largely the issue. Essentially the existing filters were configured for an index pattern with pattern id logstash-*. At some point along the way the index pattern named logstash-* was re-created in logstash-next where the default behavior is to assign a random index pattern id like acba6310-f6d3-11ea-b848-090a7444f26c. After that happened some dashboards were then updated to use the new index pattern id, and some weren't.

What I've done to address this is re-create the index pattern named logstash-*, and manually speficy an index pattern id of logstash-*. I've also created an index pattern named logstash* with pattern id acba6310-f6d3-11ea-b848-090a7444f26c.

There will be another import from ELK5 to ELK7 in the next few days, and in that process we should be able to converge on a single index pattern. And along with that dashboards will be checked individually for this problem (and general functionality) as well.

Change 655696 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: add kibana7.svc record

https://gerrit.wikimedia.org/r/655696

Change 655696 merged by Herron:
[operations/dns@master] dns: add kibana7.svc record

https://gerrit.wikimedia.org/r/655696

Change 654436 merged by Herron:
[operations/puppet@production] kibana7: add kibana7 conftool entries

https://gerrit.wikimedia.org/r/654436

ayounsi removed a subscriber: ayounsi.Jan 12 2021, 4:36 PM

Change 654437 merged by Herron:
[operations/puppet@production] kibana7: repoint (rename) kibana-next services to kibana7

https://gerrit.wikimedia.org/r/654437

Change 655754 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] ELK: promote logstash-next to logstash.wikimedia.org

https://gerrit.wikimedia.org/r/655754

Change 655802 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] kibana7: change vhost from logstash-next to logstash.wikimedia.org

https://gerrit.wikimedia.org/r/655802

Mentioned in SAL (#wikimedia-operations) [2021-01-13T17:11:20Z] <herron> beginning cutover of https://logstash.wikimedia.org frontend to ELK7 T234854

Change 655802 merged by Herron:
[operations/puppet@production] kibana7: change vhost from logstash-next to logstash.wikimedia.org

https://gerrit.wikimedia.org/r/655802

Change 655754 merged by Herron:
[operations/puppet@production] ELK: promote logstash-next to logstash.wikimedia.org

https://gerrit.wikimedia.org/r/655754

Change 655951 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elk7: enable icinga notifications

https://gerrit.wikimedia.org/r/655951

Change 655951 merged by Herron:
[operations/puppet@production] elk7: enable icinga notifications

https://gerrit.wikimedia.org/r/655951

Change 655957 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elk7: change kibana7 monitoring to critical

https://gerrit.wikimedia.org/r/655957

Change 655958 had a related patch set uploaded (by Herron; owner: Herron):
[operations/puppet@production] elk7: remove logstash-next cache setting

https://gerrit.wikimedia.org/r/655958

Change 655959 had a related patch set uploaded (by Herron; owner: Herron):
[operations/dns@master] dns: remove logstash-next.wikimedia.org record

https://gerrit.wikimedia.org/r/655959

herron updated the task description. (Show Details)Jan 14 2021, 1:58 AM

Change 654294 abandoned by Herron:
[operations/puppet@production] kibana: change backend naming from kibana-next to kibana7

Reason:
replaced by patches beginning with I40047b8824b9d44bf30d29bace4d7fd276d18e62

https://gerrit.wikimedia.org/r/654294

Change 663697 had a related patch set uploaded (by Cwhite; owner: Cwhite):
[operations/puppet@production] profile: remove logstash inputs on legacy cluster

https://gerrit.wikimedia.org/r/663697

Change 663697 merged by Cwhite:
[operations/puppet@production] profile: remove logstash inputs on legacy cluster

https://gerrit.wikimedia.org/r/663697

Mentioned in SAL (#wikimedia-operations) [2021-03-01T21:30:52Z] <shdubsh> completed removal of kafka logging inputs to legacy logstash cluster - T234854