Page MenuHomePhabricator

Changing Kibana filters is ridiculously slow
Open, MediumPublic

Description

Every time I interact with the add/edit filter interface in Kibana (the bar below the search box), the browser freezes up for 4-5 seconds. It makes dashboard creation very annoying.

Details

Related Gerrit Patches:

Event Timeline

Tgr created this task.Mar 9 2018, 8:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 9 2018, 8:39 PM

4-5 seconds sounds incredibly quick, what browser are you using? For me in Firefox 61, I tried clicking "add a filter", then after 20 seconds the caret appeared, then I typed "type" into the fields box, then I waited 2 minutes and still the browser was locked up and only the "t" had appeared. So it's effectively impossible to add filters.

It is under 5 secs for me as well. Firefox 60.0.2 on Ubuntu 18.04

Tgr added a comment.Jul 11 2018, 7:19 AM

Chrome (67.0.3396.99 on Ubuntu 16.04) in my case.

I'm not sure how it works internally, but it seems to me like Kibana composes its backned query from two things: The main search field (using Lucene query syntax), and the toggle-able filters.

Both seem equally able to filter all fields in all ways (AND, OR, NOT, etc.). But I mostly use the Lucene query syntax despite not being able to group and disable individual parts as easily (parenthesis become unmanageable after a while). The reason being, that indeed the "Add filter" interface is very slow.

The reason it is slow is because it's input field for "field name" uses autocompletion and we have a lot of garbage in our log message contexts that is polluting the index. See also T180051 and T150106.

For example:

Based on my eye-ball estimate, I would say the vast majority are of the form events.params.usages.*#*.aspect, e.g. events.params.usages.Q19595336#C.P570.aspect. Seems like these would be more appropiate as values, not as indexed keys. Probably unintentional?

It seems Kibana rebuilds this list every time you open the dropdown menu, and for every time you edit or add a filter.

Restricted Application added a project: Wikidata. · View Herald TranscriptOct 24 2018, 5:09 PM

So I made one change yesterday, https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseQualityConstraints/+/471752/ which should stop F26756076 and F25831802 when deployed next week.

I'm yet to find what is logged and causing F25831816, it could be that it is already gone?

Once the fixes are deployed how easy is it to remove then keys from the index / filter list?

Addshore moved this task from incoming to needs discussion or investigation on the Wikidata board.
Addshore added a project: User-Addshore.
kostajh added a subscriber: kostajh.Nov 6 2018, 9:07 PM
Addshore triaged this task as Medium priority.Nov 7 2018, 2:37 PM

Still there today:


Based on me eyeballing the scrollbar, I'd estimate these two groups together represent some 60% of all Logstash message keys.

@alaa_wmde This is still contributing to making Logstash nearly unusable (beyond short-term fire fighting).

I'm helping various teams get visibility into the production state of their software, but Logstash responsiveness is making that quite difficult to get to work for dashboards about specific services and teams, due to all interaction with filters either timing out (Safari), crashing the browser (Firefox) or taking 30s to a minute (Chrome). (based on sparse anecdotal evidence, YMMV).

Let me know if I can help in some way. I don't know Wikibase very well, but if the data in question isn't monitored in a Logstash query, it should be straight-forward to omit from logging and/or reformat in a way that uses a finite number of keys.

Apparently using the logstash filters is too much for my travel laptop to handle.

Could someone check to see which keys appear a ridiculous number of times in the filters right now and we can try to make some headway here!

Krinkle added a comment.EditedJun 24 2019, 11:55 PM

It's looking much better than last year, there are now only 7,665 unique log field names.

The biggest offenders attached below, but it seems none of these are Wikibase's responsibility. Looks like those were all fixed.

  • 2000+ field names for err_headers_*** and err_headers_***.raw
err_headers_***
err_headers_accept-ranges
err_headers_accept-ranges.raw
…
err_headers_set-cookie_0
err_headers_set-cookie_0.raw
…
err_headers_set-cookie_27.raw
err_headers_set-cookie_28
err_headers_set-cookie_28.raw
err_headers_set-cookie_29
err_headers_set-cookie_29.raw
…
err_headers_x-news-cache-id
err_headers_x-news-cache-id.raw
…
err_headers_x-shopify-stage
err_headers_x-shopify-stage.raw
…
err_headers_z-user-agent
err_headers_z-user-agent.raw
  • 1335 field names for request-*** and request-***.raw
request-***
request
request-0
request-0.raw
request-1
request-1.raw
request-2
request-2.raw
request-3
request-3.raw
request-4
request-4.raw
request-5
request-5.raw
request-A-IM
request-A-Im
request-A-im
request-a-im
request-A-IM.raw
request-A-Im.raw
request-A-im.raw
request-a-im.raw
request-ACCEPT
request-Accept
request-accept
…
request-DNT
request-Dnt
request-dnt
request-DNT.raw
request-Dnt.raw
request-dnt.raw
…
request-X-Appengine-City
request-X-Appengine-City.raw
request-X-Appengine-Citylatlong
request-X-Appengine-Citylatlong.raw
…
request-X-FeedBurner-URI
request-X-FeedBurner-URI.raw
…
request-X-Newrelic-Id.raw
request-x-newrelic-id.raw
…
request-X-WMF-UUID
request-x-wmf-uuid
  • 750+ field names for err_body_***.raw and err_body_***.raw
err_body_***
err_body
err_body.raw
err_body_0
err_body_0.raw
err_body_1
err_body_1.raw
err_body_10
err_body_10.raw
err_body_100
err_body_100.raw
…
err_body_386.raw
err_body_387
err_body_387.raw
err_body_388
err_body_388.raw
err_body_389
err_body_389.raw
err_body_39
err_body_39.raw
err_body_390
err_body_390.raw
…
err_body_99
err_body_99.raw
  • 300+ field names for response-*** and response-***.raw
response-***
response
response-Accept-Ranges
response-accept-ranges
response-Accept-Ranges.raw
response-accept-ranges.raw
…
response-Docker-Distribution-Api-Version
response-Docker-Distribution-Api-Version.raw
…
response-MediaWiki-API-Error
response-MediaWiki-API-Error.raw
response-MediaWiki-Login-Suppressed
response-MediaWiki-Login-Suppressed.raw
…
response-Thumbor-Convert-Time
response-thumbor-convert-time
response-Thumbor-Convert-Time.raw
response-thumbor-convert-time.raw
response-Thumbor-Engine
response-thumbor-engine
response-Thumbor-Engine.raw
response-thumbor-engine.raw
…
response-Thumbor-Xcf2Png-Time
response-Thumbor-Xcf2Png-Time.raw
response-Thumbor-Xvfb-Run-Time
response-Thumbor-Xvfb-Run-Time.raw
…
response-vegaversion
response-vegaversion.raw
…
response-X-XSS-Protection
response-x-xss-protection
response-X-XSS-Protection.raw
response-x-xss-protection.raw
response-Xkey
response-xkey
response-Xkey.raw
response-xkey.raw
  • 50+ field names for res_headers_*** and res_headers_***.raw
res_headers_***
res_headers_access-control-allow-headers
res_headers_access-control-allow-headers.raw
…
res_headers_vegaversion
res_headers_vegaversion.raw
…
res_headers_x-webkit-csp
res_headers_x-webkit-csp.raw
res_headers_x-xss-protection
res_headers_x-xss-protection.raw
Krinkle added a subscriber: ema.Jul 2 2019, 3:17 PM

per @ema, the request-{} and response-{} lines is due to T224994, which was recently enabled.

Change 520425 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishlog: request/response headers to send to logstash

https://gerrit.wikimedia.org/r/520425

ema added a comment.EditedJul 3 2019, 12:50 PM

I've prepared https://gerrit.wikimedia.org/r/520425 to restrict the request/response headers sent to logstash by the various varnishlogconsumer daemons.

That being said, it seems to me that the problem here is really that the kibana field name dropdown implementation does not scale. We can surely try to limit the number of fields we send, but it seems like a Sisyphean effort really.

Restricted Application added a project: Operations. · View Herald TranscriptJul 3 2019, 12:51 PM
ssastry removed a subscriber: ssastry.Jul 3 2019, 4:36 PM
ema moved this task from Triage to Watching on the Traffic board.Jul 5 2019, 11:59 AM

Change 520425 merged by Ema:
[operations/puppet@production] varnishlog: request/response headers to send to logstash

https://gerrit.wikimedia.org/r/520425

I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.

@ema @fgiunchedi Has the above change been deployed?

ema added a comment.Sep 11 2019, 9:14 AM

I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.
@ema @fgiunchedi Has the above change been deployed?

It has indeed, on August 21 2019.

I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.

That's indeed unexpected, can you share how you are doing the analysis/pulling the field names? I'd be interested myself as well.

In terms of fixing/mitigation, what @ema said is certainly true that for kibana to load all fields in a dropdown is a losing battle, what I'm currently thinking is:

  • Find worst offenders (i.e. indexed fields like errors_1_message or arbitrary strings like response/request headers) and either change the service to log differently or ban/don't turn into fields from logstash (similar to https://phabricator.wikimedia.org/T180051#4439980 from @EBernhardson ). There doesn't seem a perfect solution but whitelisting say the top 500-800 fields in logstash might work without too much disruption / unexpected results
  • Check what modern Kibana does and whether it'll mitigate further the problem of many fields indexed

I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.

That's indeed unexpected, can you share how you are doing the analysis/pulling the field names?

  1. Open a Logstash dashboard in a Chromium browser, and open the Dev Tools.
  2. Edit or create a filter bubble in the Kibana UI, and open the channel dropdown.
  3. Then, from the Console tab in Dev Tools, execute copy($$('ul.uiSelectChoices--autoWidth.ui-select-dropdown')[0].textContent)

This queries the DOM for the <ul> node that represents the channel dropdown menu, then uses textContent (recursively aggregates the textual content of all child list items and concatenates it), and copies it to your clipboard.

Then, paste in a text editor and use some method of removing empty lines and count them :)

Krinkle added a comment.EditedThu, Nov 7, 6:15 PM

Using the above method, I now get 11,086 unique fields in the dropdown menu. That's significantly more than last month.

The fields we thought that came from Varnish that the deployed patch is expected to remove, are still there, including:

request-X-Appengine-City
…
response-MediaWiki-API-Error
…
EBernhardson added a comment.EditedThu, Nov 7, 7:05 PM

I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.

That's indeed unexpected, can you share how you are doing the analysis/pulling the field names?

  1. Open a Logstash dashboard in a Chromium browser, and open the Dev Tools.
  2. Edit or create a filter bubble in the Kibana UI, and open the channel dropdown.
  3. Then, from the Console tab in Dev Tools, execute copy($$('ul.uiSelectChoices--autoWidth.ui-select-dropdown')[0].textContent)

This queries the DOM for the <ul> node that represents the channel dropdown menu, then uses textContent (recursively aggregates the textual content of all child list items and concatenates it), and copies it to your clipboard.
Then, paste in a text editor and use some method of removing empty lines and count them :)

The more direct place to get this information is to click the Management (gear/cog) link in sidebar and select Index Patterns. This will report all the fields kibana knows about, along with counts. Today it lists 11091 fields. I'm not sure when exactly this metadata updates, or if it's real time. The refresh button which gives a big warning about resetting popularity counters suggests to it might not auto-update? We can compare to the actual indices with a bit of jq magic, but would take a bit to work up.

CDanis added a subscriber: CDanis.Thu, Nov 7, 7:41 PM

I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.

That's indeed unexpected, can you share how you are doing the analysis/pulling the field names?

  1. Open a Logstash dashboard in a Chromium browser, and open the Dev Tools.
  2. Edit or create a filter bubble in the Kibana UI, and open the channel dropdown.
  3. Then, from the Console tab in Dev Tools, execute copy($$('ul.uiSelectChoices--autoWidth.ui-select-dropdown')[0].textContent)

This queries the DOM for the <ul> node that represents the channel dropdown menu, then uses textContent (recursively aggregates the textual content of all child list items and concatenates it), and copies it to your clipboard.
Then, paste in a text editor and use some method of removing empty lines and count them :)

The more direct place to get this information is to click the Management (gear/cog) link in sidebar and select Index Patterns. This will report all the fields kibana knows about, along with counts. Today it lists 11091 fields. I'm not sure when exactly this metadata updates, or if it's real time. The refresh button which gives a big warning about resetting popularity counters suggests to it might not auto-update? We can compare to the actual indices with a bit of jq magic, but would take a bit to work up.

IME it is safe to refresh counts despite the warnings, but yes they don't auto-refresh. Thanks for the pointers on how to get the mappings, I did a bit more digging and filed a bunch of tasks for obvious offenders.

It looks like once the retention (max 90d) for problematic indices kicks in we should be in a better position, with more manageable fields.