Every time I interact with the add/edit filter interface in Kibana (the bar below the search box), the browser freezes up for 4-5 seconds. It makes dashboard creation very annoying.
Description
Details
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| varnishlog: request/response headers to send to logstash | operations/puppet | production | +47 -14 |
Event Timeline
4-5 seconds sounds incredibly quick, what browser are you using? For me in Firefox 61, I tried clicking "add a filter", then after 20 seconds the caret appeared, then I typed "type" into the fields box, then I waited 2 minutes and still the browser was locked up and only the "t" had appeared. So it's effectively impossible to add filters.
I'm not sure how it works internally, but it seems to me like Kibana composes its backned query from two things: The main search field (using Lucene query syntax), and the toggle-able filters.
Both seem equally able to filter all fields in all ways (AND, OR, NOT, etc.). But I mostly use the Lucene query syntax despite not being able to group and disable individual parts as easily (parenthesis become unmanageable after a while). The reason being, that indeed the "Add filter" interface is very slow.
The reason it is slow is because it's input field for "field name" uses autocompletion and we have a lot of garbage in our log message contexts that is polluting the index. See also T180051 and T150106.
For example:
Based on my eye-ball estimate, I would say the vast majority are of the form events.params.usages.*#*.aspect, e.g. events.params.usages.Q19595336#C.P570.aspect. Seems like these would be more appropiate as values, not as indexed keys. Probably unintentional?
It seems Kibana rebuilds this list every time you open the dropdown menu, and for every time you edit or add a filter.
So I made one change yesterday, https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikibaseQualityConstraints/+/471752/ which should stop F26756076 and F25831802 when deployed next week.
I'm yet to find what is logged and causing F25831816, it could be that it is already gone?
Once the fixes are deployed how easy is it to remove then keys from the index / filter list?
Still there today:
Based on me eyeballing the scrollbar, I'd estimate these two groups together represent some 60% of all Logstash message keys.
@alaa_wmde This is still contributing to making Logstash nearly unusable (beyond short-term fire fighting).
I'm helping various teams get visibility into the production state of their software, but Logstash responsiveness is making that quite difficult to get to work for dashboards about specific services and teams, due to all interaction with filters either timing out (Safari), crashing the browser (Firefox) or taking 30s to a minute (Chrome). (based on sparse anecdotal evidence, YMMV).
Let me know if I can help in some way. I don't know Wikibase very well, but if the data in question isn't monitored in a Logstash query, it should be straight-forward to omit from logging and/or reformat in a way that uses a finite number of keys.
Apparently using the logstash filters is too much for my travel laptop to handle.
Could someone check to see which keys appear a ridiculous number of times in the filters right now and we can try to make some headway here!
It's looking much better than last year, there are now only 7,665 unique log field names.
The biggest offenders attached below, but it seems none of these are Wikibase's responsibility. Looks like those were all fixed.
- 2000+ field names for err_headers_*** and err_headers_***.raw
err_headers_accept-ranges err_headers_accept-ranges.raw … err_headers_set-cookie_0 err_headers_set-cookie_0.raw … err_headers_set-cookie_27.raw err_headers_set-cookie_28 err_headers_set-cookie_28.raw err_headers_set-cookie_29 err_headers_set-cookie_29.raw … err_headers_x-news-cache-id err_headers_x-news-cache-id.raw … err_headers_x-shopify-stage err_headers_x-shopify-stage.raw … err_headers_z-user-agent err_headers_z-user-agent.raw
- 1335 field names for request-*** and request-***.raw
request request-0 request-0.raw request-1 request-1.raw request-2 request-2.raw request-3 request-3.raw request-4 request-4.raw request-5 request-5.raw request-A-IM request-A-Im request-A-im request-a-im request-A-IM.raw request-A-Im.raw request-A-im.raw request-a-im.raw request-ACCEPT request-Accept request-accept … request-DNT request-Dnt request-dnt request-DNT.raw request-Dnt.raw request-dnt.raw … request-X-Appengine-City request-X-Appengine-City.raw request-X-Appengine-Citylatlong request-X-Appengine-Citylatlong.raw … request-X-FeedBurner-URI request-X-FeedBurner-URI.raw … request-X-Newrelic-Id.raw request-x-newrelic-id.raw … request-X-WMF-UUID request-x-wmf-uuid
- 750+ field names for err_body_***.raw and err_body_***.raw
err_body err_body.raw err_body_0 err_body_0.raw err_body_1 err_body_1.raw err_body_10 err_body_10.raw err_body_100 err_body_100.raw … err_body_386.raw err_body_387 err_body_387.raw err_body_388 err_body_388.raw err_body_389 err_body_389.raw err_body_39 err_body_39.raw err_body_390 err_body_390.raw … err_body_99 err_body_99.raw
- 300+ field names for response-*** and response-***.raw
response response-Accept-Ranges response-accept-ranges response-Accept-Ranges.raw response-accept-ranges.raw … response-Docker-Distribution-Api-Version response-Docker-Distribution-Api-Version.raw … response-MediaWiki-API-Error response-MediaWiki-API-Error.raw response-MediaWiki-Login-Suppressed response-MediaWiki-Login-Suppressed.raw … response-Thumbor-Convert-Time response-thumbor-convert-time response-Thumbor-Convert-Time.raw response-thumbor-convert-time.raw response-Thumbor-Engine response-thumbor-engine response-Thumbor-Engine.raw response-thumbor-engine.raw … response-Thumbor-Xcf2Png-Time response-Thumbor-Xcf2Png-Time.raw response-Thumbor-Xvfb-Run-Time response-Thumbor-Xvfb-Run-Time.raw … response-vegaversion response-vegaversion.raw … response-X-XSS-Protection response-x-xss-protection response-X-XSS-Protection.raw response-x-xss-protection.raw response-Xkey response-xkey response-Xkey.raw response-xkey.raw
- 50+ field names for res_headers_*** and res_headers_***.raw
res_headers_access-control-allow-headers res_headers_access-control-allow-headers.raw … res_headers_vegaversion res_headers_vegaversion.raw … res_headers_x-webkit-csp res_headers_x-webkit-csp.raw res_headers_x-xss-protection res_headers_x-xss-protection.raw
Change 520425 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishlog: request/response headers to send to logstash
I've prepared https://gerrit.wikimedia.org/r/520425 to restrict the request/response headers sent to logstash by the various varnishlogconsumer daemons.
That being said, it seems to me that the problem here is really that the kibana field name dropdown implementation does not scale. We can surely try to limit the number of fields we send, but it seems like a Sisyphean effort really.
Change 520425 merged by Ema:
[operations/puppet@production] varnishlog: request/response headers to send to logstash
I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.
@ema @fgiunchedi Has the above change been deployed?
That's indeed unexpected, can you share how you are doing the analysis/pulling the field names? I'd be interested myself as well.
In terms of fixing/mitigation, what @ema said is certainly true that for kibana to load all fields in a dropdown is a losing battle, what I'm currently thinking is:
- Find worst offenders (i.e. indexed fields like errors_1_message or arbitrary strings like response/request headers) and either change the service to log differently or ban/don't turn into fields from logstash (similar to https://phabricator.wikimedia.org/T180051#4439980 from @EBernhardson ). There doesn't seem a perfect solution but whitelisting say the top 500-800 fields in logstash might work without too much disruption / unexpected results
- Check what modern Kibana does and whether it'll mitigate further the problem of many fields indexed
- Open a Logstash dashboard in a Chromium browser, and open the Dev Tools.
- Edit or create a filter bubble in the Kibana UI, and open the channel dropdown.
- Then, from the Console tab in Dev Tools, execute copy($$('ul.uiSelectChoices--autoWidth.ui-select-dropdown')[0].textContent)
This queries the DOM for the <ul> node that represents the channel dropdown menu, then uses textContent (recursively aggregates the textual content of all child list items and concatenates it), and copies it to your clipboard.
Then, paste in a text editor and use some method of removing empty lines and count them :)
Using the above method, I now get 11,086 unique fields in the dropdown menu. That's significantly more than last month.
The fields we thought that came from Varnish that the deployed patch is expected to remove, are still there, including:
request-X-Appengine-City … response-MediaWiki-API-Error …
The more direct place to get this information is to click the Management (gear/cog) link in sidebar and select Index Patterns. This will report all the fields kibana knows about, along with counts. Today it lists 11091 fields. I'm not sure when exactly this metadata updates, or if it's real time. The refresh button which gives a big warning about resetting popularity counters suggests to it might not auto-update? We can compare to the actual indices with a bit of jq magic, but would take a bit to work up.
IME it is safe to refresh counts despite the warnings, but yes they don't auto-refresh. Thanks for the pointers on how to get the mappings, I did a bit more digging and filed a bunch of tasks for obvious offenders.
It looks like once the retention (max 90d) for problematic indices kicks in we should be in a better position, with more manageable fields.
This is still an issue.
Editing Kibana dashboards:
- In Safari, crashes the tab.
- In Firefox, times out after 30 seconds, the user has to tell Firefox to "Wait for the slow script" and then wait another 20 seconds, in order to open the dropdown menu once. This cycle is repeated for every action with a dashboard filter.
- in Chrome, takes about 10-20 seconds, works most times.
I realize this isn't really the right solution but could you try https://logstash-next.wikimedia.org ? I tried there and "works on my computer". That's the elastic 7 stack (cfr T234854) although the upgrade is still in progress it should give us a good idea whether Kibana 7 "fixes" the problem, thanks!








