Page MenuHomePhabricator

thanos-query overload due to heavy queries
Closed, ResolvedPublic

Description

Today thanos-query in eqiad paged due to overload. On titan1001 the service recovered itself, however on titan1002 load stayed up until I restarted thanos-query manually.

Unsurprisingly, the root cause is some heavy queries. Note the queries below are the ones going on at the time and for which we returned non-200, not all queries in here are necessarily problematic!

2025-02-05T10:52:58 119 days, 0:00:00 'sum(rate(mediawiki_WikimediaEvents_editResponseTime_seconds_count[1d]) * 60)' (replied 200 to Grafana/9.5.18)
2025-02-05T10:54:36 90 days, 0:00:00 'sum by(wiki, platform, type) (increase(mediawiki_WikimediaEvents_temp_account_creation_throttled_total{wiki=~"(cswikiversity|dawiki|fawiktionary|igwiki|itwikiquote|jawikibooks|nowiki|rowiki|scwiki|shwi
ki|srwiki|zh_yuewiki)", platform=~"(android|commons|desktop|ios|linux/amd64|mobile|unknown|web)"}[1h]))' (replied 503 to Grafana/9.5.18)
2025-02-05T10:54:37 90 days, 0:00:00 'sum by(wiki) (mediawiki_WikimediaEvents_local_temporary_account_ip_viewers_total{wiki=~"(cswikiversity|dawiki|fawiktionary|igwiki|itwikiquote|jawikibooks|nowiki|rowiki|scwiki|shwiki|srwiki|zh_yuewiki)"
})' (replied 503 to Grafana/9.5.18)
2025-02-05T10:54:38 90 days, 0:00:00 '(sum(mediawiki_WikimediaEvents_local_temporary_account_ip_viewers_with_enabled_preference_total{wiki=~"(cswikiversity|dawiki|fawiktionary|igwiki|itwikiquote|jawikibooks|nowiki|rowiki|scwiki|shwiki|srwi
ki|zh_yuewiki)"}) by (wiki)) / ( (sum(mediawiki_WikimediaEvents_local_temporary_account_ip_viewers_total{wiki=~"(cswikiversity|dawiki|fawiktionary|igwiki|itwikiquote|jawikibooks|nowiki|rowiki|scwiki|shwiki|srwiki|zh_yuewiki)"}) by (wiki)) 
- (sum(mediawiki_WikimediaEvents_locally_auto_enrolled_temporary_account_ip_viewers_total{wiki=~"(cswikiversity|dawiki|fawiktionary|igwiki|itwikiquote|jawikibooks|nowiki|rowiki|scwiki|shwiki|srwiki|zh_yuewiki)"}) by (wiki) ))' (replied 503
 to Grafana/9.5.18)
2025-02-05T10:54:38 90 days, 0:00:00 'mediawiki_WikimediaEvents_global_temporary_account_ip_viewers_total' (replied 503 to Grafana/9.5.18)
2025-02-05T10:54:39 90 days, 0:00:00 'mediawiki_WikimediaEvents_global_temporary_account_ip_viewers_with_enabled_preference_total' (replied 503 to Grafana/9.5.18)
2025-02-05T10:54:39 90 days, 0:00:00 'sum by(wiki, user) (increase(mediawiki_WikimediaEvents_block_target_total{wiki=~"(cswikiversity|dawiki|fawiktionary|igwiki|itwikiquote|jawikibooks|nowiki|rowiki|scwiki|shwiki|srwiki|zh_yuewiki)", user=
~"(anon|iprange|normal|temp)"}[1h]))' (replied 503 to Grafana/9.5.18)
2025-02-05T10:54:40 90 days, 0:00:00 'mediawiki_WikimediaEvents_global_temporary_account_ip_viewers_total' (replied 503 to Grafana/9.5.18)
2025-02-05T10:54:40 90 days, 0:00:00 'sum by(user) (increase(mediawiki_WikimediaEvents_global_block_target_total{user=~"(anon|iprange|normal|temp)"}[1h]))' (replied 503 to Grafana/9.5.18)
2025-02-05T10:53:26 119 days, 0:00:00 'max_over_time(sum(rate(mediawiki_WikimediaEvents_editResponseTime_seconds_count[2m] offset -7d) * 60)[1w:2m])' (replied 502 to Grafana/9.5.18)
2025-02-05T10:53:56 119 days, 0:00:00 'min_over_time(sum(rate(mediawiki_WikimediaEvents_editResponseTime_seconds_count[2m] offset -7d) * 60)[1w:2m])' (replied 502 to Grafana/9.5.18)
2025-02-05T10:54:04 1 day, 0:00:00 '(sum(increase(webperf_navigationtiming_responsestart_seconds_bucket{is_oversample="False", mw_context=~"anonymous_mainspace_view", mw_skin=~"minerva", le="0.8"}[24h]))/sum(increase(webperf_navigationtimi
ng_responsestart_seconds_bucket{is_oversample="False", mw_context=~"anonymous_mainspace_view", mw_skin=~"minerva", le="+Inf"}[24h]))) - (sum(increase(webperf_navigationtiming_responsestart_seconds_bucket{is_oversample="False", mw_context=~
"anonymous_mainspace_view", mw_skin=~"minerva", le="0.8"}[24h] offset 7d))/sum(increase(webperf_navigationtiming_responsestart_seconds_bucket{is_oversample="False", mw_context=~"anonymous_mainspace_view", mw_skin=~"minerva", le="+Inf"}[24h
] offset 7d)))' (replied 502 to Grafana/9.5.18)
2025-02-05T10:54:05 0:10:00 '100 * sum(rate(varnish_resourceloader_resp{x_cache=~"(hit|int).*"}[1h])) / sum(rate(varnish_resourceloader_resp[1h]))' (replied 502 to Grafana/9.5.18)
2025-02-05T10:54:06 90 days, 0:00:00 'sum by(user) (increase(mediawiki_WikimediaEvents_editResponseTime_seconds_sum{wiki=~"(cswikiversity|dawiki|fawiktionary|igwiki|itwikiquote|jawikibooks|nowiki|rowiki|scwiki|shwiki|srwiki|zh_yuewiki)", u
ser=~"(anon|iprange|normal|temp)"}[1h]))' (replied 502 to Grafana/9.5.18)
2025-02-05T10:54:06 90 days, 0:00:00 'sum by(wiki, user) (increase(mediawiki_WikimediaEvents_editResponseTime_seconds_sum{wiki=~"(cswikiversity|dawiki|fawiktionary|igwiki|itwikiquote|jawikibooks|nowiki|rowiki|scwiki|shwiki|srwiki|zh_yuewik
i)", user=~"(anon|iprange|normal|temp)"}[1h]))' (replied 502 to Grafana/9.5.18)
2025-02-05T10:54:06 90 days, 0:00:00 'sum by(user, platform) (increase(mediawiki_WikimediaEvents_editResponseTime_seconds_sum{wiki=~"(cswikiversity|dawiki|fawiktionary|igwiki|itwikiquote|jawikibooks|nowiki|rowiki|scwiki|shwiki|srwiki|zh_yu
ewiki)", user=~"(anon|iprange|normal|temp)", platform=~"(android|commons|desktop|ios|linux/amd64|mobile|unknown|web)"}[1h]))' (replied 502 to Grafana/9.5.18)
2025-02-05T10:54:06 0:05:00 'histogram_quantile(0.5, sum by (le) (rate(cpjobqueue_normal_rule_processing_delay_bucket{rule=~".*DispatchChanges$",rule!~".*-partitioner-mediawiki-job-.*", service="cpjobqueue"}[15m])))' (replied 502 to Grafan
a/9.5.18)

Event Timeline

As a first mitigation I changed the obvious offenders to use the mw edit rate recording rule (T383963)

@fgiunchedi thank you for dealing with this. Is there anything we should change as dashboard / metric owners here?

@mszabo great question and thank you! The most immediate change is to introduce recording rules for big metrics (or whole big queries, though that's more rare in my experience). For example what I did for edit count dashboard is change

sum(rate(mediawiki_WikimediaEvents_editResponseTime_seconds_sum[2m])) * 60
with
sum(entry_kubernetes_namespace:mediawiki_WikimediaEvents_editResponseTime_seconds_count:rate2m) * 60

and similarly in other panels. Going forward we (o11y) will be auditing Prometheus queries in dashboards to proactively identify potential other similar cases. I'm also sure that edit rate above is not the only offender!

@fgiunchedi Would it be useful to convert some of these labels into dedicated metrics? Having per-wiki metrics would probably be a step too far, but having a metric per account type seems more reasonable.

I've changed the default date range for the dashboard to 6h for now while we determine next steps.

I suggest disabling the dashboard until it's fixed. This is impacting all of grafana

@fgiunchedi Would it be useful to convert some of these labels into dedicated metrics? Having per-wiki metrics would probably be a step too far, but having a metric per account type seems more reasonable.

Yes that can be a mitigation depending on cardinality at the cost of being harder to query for users, though I'd wait and see how we go with recording rules

I've clarified in the task description that not all queries/metrics posted are necessarily problematic, those are some of the queries we returned errors for! Apologies for the confusion.

Today's problem was similar yet different, I suspect envoy metrics to be related since those tend to be quite big, for example:

2025-02-06T15:24:15 1 day, 3:30:00 'histogram_quantile(0.9, sum(rate(envoy_http_downstream_rq_time_bucket{site="eqiad", prometheus="k8s", app="mediawiki",kubernetes_namespace=~".*", envoy_http_conn_manager_prefix!~"admin|admin_interface"}[2m])) by (le))' (replied 503 to Grafana/9.5.18)
2025-02-06T15:24:06 1 day, 3:30:00 'envoy_cluster_upstream_rq{site="eqiad", prometheus="k8s", app=~"(apertium|api-gateway|aqs-http-gateway|chart-renderer|chromium-render|citoid|cxserver|developer-portal|eventgate|eventstreams|flink-app-commons|flink-app-consumer-cloudelastic|flink-app-consumer-search|flink-app-main|flink-app-producer|flink-app-wikidata|function-evaluator|function-orchestrator|ipoid|istio-ingressgateway|kartotherian|kask|linkrecommendation|machinetranslation|mathoid|mediawiki|miscweb|mobileapps|push-notifications|recommendation-api|shellbox|tegola-vector-tiles|termbox|toolhub|wikifeeds|zotero)", kubernetes_namespace=~".*"}' (replied 200 to Grafana/9.5.18)

There's also an open question in my mind as to why thanos-query consumes quite a bit of memory yet the OOM killer doesn't kick in. Unfortunately I don't have the exact text now though systemctl status thanos.slice showed a few kb of memory available, as if to say that memory grew up to the limit and no further

Change #1122589 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] prometheus: add 90pct to envoy recording rules

https://gerrit.wikimedia.org/r/1122589

Change #1122589 merged by Filippo Giunchedi:

[operations/puppet@production] prometheus: add 90pct to envoy recording rules

https://gerrit.wikimedia.org/r/1122589

I have taken another look at the dashboards using big envoy metrics such as https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s . We will have to do a deeper audit of the metrics, though in the meantime I have tested with at least changing the queries used to load the dashboard variables with the recording rules.

From grafana's perspective the diff looks like this:

  • changed definition label_values(envoy_cluster_upstream_rq{site="$site", prometheus="$prometheus", kubernetes_namespace=~"$kubernetes_namespace"},app) to label_values(app_envoy_cluster_namespace:envoy_cluster_upstream_rq_time:50pct2m{site="$site", prometheus="$prometheus", kubernetes_namespace=~"$kubernetes_namespace"},app)
  • changed query label_values(envoy_cluster_upstream_rq{site="$site", prometheus="$prometheus", kubernetes_namespace=~"$kubernetes_namespace"},app) to label_values(app_envoy_cluster_namespace:envoy_cluster_upstream_rq_time:50pct2m{site="$site", prometheus="$prometheus", kubernetes_namespace=~"$kubernetes_namespace"},app)
  • changed definition label_values(envoy_cluster_upstream_rq{site="$site", prometheus="$prometheus", app=~"$app", kubernetes_namespace=~"$kubernetes_namespace"},envoy_cluster_name) to label_values(app_envoy_cluster_namespace:envoy_cluster_upstream_rq_time:50pct2m{site="$site", prometheus="$prometheus", app=~"$app", kubernetes_namespace=~"$kubernetes_namespace"},envoy_cluster_name)
  • changed query label_values(envoy_cluster_upstream_rq{site="$site", prometheus="$prometheus", app=~"$app", kubernetes_namespace=~"$kubernetes_namespace"},envoy_cluster_name) to label_values(app_envoy_cluster_namespace:envoy_cluster_upstream_rq_time:50pct2m{site="$site", prometheus="$prometheus", app=~"$app", kubernetes_namespace=~"$kubernetes_namespace"},envoy_cluster_name)
fgiunchedi claimed this task.

I'm confident enough with the new queries limits enforcing to resolve this