Page MenuHomePhabricator

figure out why Kafka dashboard hammers Prometheus, and fix it
Closed, ResolvedPublic

Description

(part of https://wikitech.wikimedia.org/wiki/Incident_documentation/20190425-prometheus)

NB: It would probably be unwise to do extensive testing on modifications here before T222105 is completed

Loading too much history on the Kafka dashboard gives Prometheus an OOM-flavored stomachache. Whatever this dashboard is doing, it should do a lot less of it. (Might also involve writing some recording rules?)

Event Timeline

+1! I'm expecting the most effective mitigation to be recording rules, followed by loading less panels

I'm pretty sure it is these panels that are responsible for the most Prometheus load

image.png (965×1 px, 156 KB)

They take much longer to load than the rest of the panels, and some of them errored out with the new settings.

The common thread here (along with a few other long-running panels off-screen) seems to be patterns like node_memory_Cached_bytes{cluster="$cluster",instance=~"$kafka_broker:.*"} where we are doing a regex match with grafana variable substitution against metric names that are exported by the entire fleet. (In the default case, this will evaluate to node_memory_Cached_bytes{cluster="kafka_jumbo",instance=~"(kafka-jumbo1001|kafka-jumbo1002|kafka-jumbo1003|kafka-jumbo1004|kafka-jumbo1005|kafka-jumbo1006):.*"})

It's very interesting that changing --query.max-samples had no effect here (see also T222105). My suspicion is that constructing the query this way requires a very expensive traversal of the TSDB to find the relevant timeseries. This would explain why it wasn't impacted by max-samples. Explaining why it isn't affected by query.timeout must mean that there's a Prometheus bug at this particular stage of query execution, where timeouts aren't checked (or query context isn't made available). It's possible T222113 would help with this.

For now I think this dashboard needs to have a bunch of these panels removed, or changed to use recording rules. @Ottomata @elukey do you think you could do this soon?

Dzahn triaged this task as High priority.Apr 30 2019, 9:37 PM

Hm, I just edited some of those graphs so that A. they didn't use regex '=~' matching for $kafka_brokers, or if they did, I removed the :.* part, which was unneeded. (not sure if removing the .* will actually make the regexes any more performant.)

I'd like make some Rows like we do with some other dashboards, and have collapsed by default. However, I can't seem to use the Grafana UI to add a Row in the proper place on the dashboard, or move them later! They always end up on top, with some weird selection of graphs added. I must be missing something in the UI to do this properly...

Ah, Chris clued me in, I have to collapse the Row in order to move it.

I've modified the Kafka dashboard so that only the Summary Row is uncollapsed bym default. I've also changed the default time range to last 3 hours, rather than last 24.

If necessary, we can remove some of the graph panels that are also already in node/cluster dashboards, but I'd rather keep them here if I could. The node/cluster ones have the information, but IMO are a bit harder to read because of the way they are organized. It's also nice to be able to see the behavior Kafka in the same dashboard as CPU/memory, disk IO, etc.

I've modified the Kafka dashboard so that only the Summary Row is uncollapsed bym default. I've also changed the default time range to last 3 hours, rather than last 24.

Thanks, that's a start. Can you also disable auto-refresh?

If necessary, we can remove some of the graph panels that are also already in node/cluster dashboards, but I'd rather keep them here if I could. The node/cluster ones have the information, but IMO are a bit harder to read because of the way they are organized. It's also nice to be able to see the behavior Kafka in the same dashboard as CPU/memory, disk IO, etc.

Can you rewrite the panels here to use the same precomputed rules that are in the cluster dashboards? e.g. instance_mode:cpu:rate5m{instance=~"$instance:.*",mode!="idle"} for CPU usage rate. That should make them a lot faster to load and much less impactful on the Prometheus server.

@CDanis please feel free, if you know exactly what needs to be changed, to modify all the necessary panels in the Kafka dashboard.. These graphs started as something that only Analytics used, but now they are now common for everybody, so we (Analytics) have not any issue in other folks modifying the dashboard. I don't want to dodge work of course, but only allowing anybody to quickly fix this issue rather than waiting back and forth chats in this task :)

I tried to set instance_mode:cpu:rate5m{instance=~"$kafka_broker:.*",mode!="idle"} for the cpu usage panel but I ended up in loading all the hosts.. The main problem (I think) is that the custom All value for the $kafka_broker variable is .*, that is probably not the right one. What should be a good value?

I think you should just be able to remove the "custom all value" in the dashboard settings and have it work. In this case Grafana will create its own 'all' value that is simply a regex OR'ing together all the known values, which it looks like it computes based on the cluster=kafka_jumbo hidden variable.

Also sorry, I don't have a lot of time left over this week; can take a deeper look next week

I think you should just be able to remove the "custom all value" in the dashboard settings and have it work. In this case Grafana will create its own 'all' value that is simply a regex OR'ing together all the known values, which it looks like it computes based on the cluster=kafka_jumbo hidden variable.

From a quick test it seems that the default "all" option leads to $kafka_broker being only kafka_jumbo1006 (that is the last one listed)..

Note to self: remember that doing the above breaks all the kafka graphs

It seems that the following happens when using the default all value (using the current cpu usage query since it ends up in the same problem):

avg by (instance) (irate(node_cpu{cluster="$cluster",mode!="idle",instance=~"kafka-jumbo1001|kafka-jumbo1002|kafka-jumbo1003|kafka-jumbo1004|kafka-jumbo1005|kafka-jumbo1006.*"}[5m]))

If I remove some kafka-jumbo from the instance regex I'll end up showing up only the last one. Adding round brackets solves the problem, but apparently Grafana is not doing it (I checked using the query inspector).

This works with the default all value:

avg by (instance) (irate(node_cpu{cluster="$cluster",mode!="idle",instance=~"($kafka_broker).*"}[5m]))

Swapped all the occurrences of instance=~"$kafka_broker" with instance=~"($kafka_broker).*", and the dashboard seems loading faster now. Also removed the .* custom value from the $kafka_broker All values field.

CDanis claimed this task.

It does seem much faster now, thanks @elukey ! Impact of loading 30 days on Prometheus is also minimal now -- modest CPU usage and while there was some increase in RAM consumption over baseline while we were both playing with this, it's not concerning. Thank you :)