thanos-query probedown due to OOM of both eqiad titan frontends
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	MatthewVernon
	Feb 6 2024, 4:35 PM

Description

We got paged today:

https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All thanos-query:443 failed when probed by http_thanos-query_ip4 from eqiad. Availability is 0%. https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2?_g=(filters:!((query:(match_phrase:(service.name:http_thanos-query_ip4))))) https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page
Alerts Firing:
Labels:
 - alertname = ProbeDown
 - address = 10.2.2.53
 - family = ip4
 - instance = thanos-query:443
 - job = probes/service
 - module = http_thanos-query_ip4
 - prometheus = ops
 - severity = page
 - site = eqiad
 - source = prometheus
 - team = sre
Annotations:
 - dashboard = https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All
 - description = thanos-query:443 failed when probed by http_thanos-query_ip4 from eqiad. Availability is 0%.
 - logs = https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2?_g=(filters:!((query:(match_phrase:(service.name:http_thanos-query_ip4)))))
 - runbook = https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443
 - summary = Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page
Source: https://prometheus-eqiad.wikimedia.org/ops/graph?g0.expr=%28avg_over_time%28probe_success%7Bjob%3D~%22probes%2F.%2A%22%2Cmodule%3D~%22%28http%7Ctcp%29.%2A%22%7D%5B1m%5D%29+and+on+%28instance%29+service_catalog_page+%3D%3D+1%29+%2A+100+%3C+10&g0.tab=1

On both eqiad titan nodes, the OOM killer came for thanos:

2024-02-06T16:02:48.599094+00:00 titan1002 kernel: [529285.100137] Out of memory: Killed process 944 (thanos) total-vm:25141640kB, anon-rss:20740672kB, file-rss:0kB, shmem-rss:0kB, UID:111 pgtables:47124kB oom_score_adj:0
2024-02-06T16:02:42.108834+00:00 titan1001 kernel: [534723.800913] Out of memory: Killed process 935 (thanos) total-vm:25080384kB, anon-rss:24252060kB, file-rss:0kB, shmem-rss:0kB, UID:111 pgtables:47820kB oom_score_adj:0

Systemd notices and restarts the service, so the incident self-resolved:

2024-02-06T16:02:48.599092+00:00 titan1002 kernel: [529285.100104] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/thanos-query.service,task=thanos,pid=944,uid=111
2024-02-06T16:02:48.790450+00:00 titan1002 systemd[1]: thanos-query.service: A process of this unit has been killed by the OOM killer.
2024-02-06T16:02:49.814773+00:00 titan1002 systemd[1]: thanos-query.service: Main process exited, code=killed, status=9/KILL
2024-02-06T16:02:49.814901+00:00 titan1002 systemd[1]: thanos-query.service: Failed with result 'oom-kill'.
2024-02-06T16:02:49.815302+00:00 titan1002 systemd[1]: thanos-query.service: Consumed 11h 54min 16.327s CPU time.
2024-02-06T16:02:49.916665+00:00 titan1002 systemd[1]: thanos-query.service: Scheduled restart job, restart counter is at 1.
2024-02-06T16:02:49.917068+00:00 titan1002 systemd[1]: Stopped thanos-query.service - Thanos query.
2024-02-06T16:02:49.917829+00:00 titan1002 systemd[1]: thanos-query.service: Consumed 11h 54min 16.327s CPU time.
2024-02-06T16:02:49.963018+00:00 titan1002 systemd[1]: Started thanos-query.service - Thanos query.

My best guess is that a particular query is what caused the memory overuse (since that would explain the almost-identical timings?), but that might warrant further investigation. It would be nice if user error didn't kill the service, however briefly. And obviously if that happens frequently enough eventually systemd will decide the service is failed and stop trying to restart it.

From an oncall perspective, too, it'd be nice if the thanos docs talked about which things are served by titan* rather than thanos-fe* (and maybe the alert could be a bit clearer in this regard too)?

Details

Subject	Repo	Branch	Lines +/-
thanos: parse more queries from thanos access logs	operations/puppet	production	+81 -24
thanos: revert to standard logging level	operations/puppet	production	+2 -4
thanos: ship tool to analyze query apache access logs	operations/puppet	production	+129 -0
thanos: limit query length in frontend	operations/puppet	production	+10 -5
thanos: set only MemoryMax limit	operations/puppet	production	+1 -2
thanos: run Thanos components in a systemd slice	operations/puppet	production	+55 -11
thanos: enable request debug for query / query-frontend	operations/puppet	production	+46 -4
thanos: tighten memory limits for query/query-frontend	operations/puppet	production	+4 -4

Customize query in gerrit

Related Objects

Mentioned In: T359879: SLO dashboards for Lift Wing showing unexpected values
T358626: Review of log level settings for prometheus-blackbox-exporter and thanos-query
T357888: PuppetZeroResources
Mentioned Here: T358634: OOM livelock stalls

Event Timeline

MatthewVernon created this task.Feb 6 2024, 4:35 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 6 2024, 4:35 PM

CDanis subscribed.Feb 6 2024, 4:40 PM

Potentially a similar issue (request/traffic related high load) happened around the past 28 of September, when I added this TODO: https://wikitech.wikimedia.org/w/index.php?title=Thanos&diff=prev&oldid=2115837 I was confused back then even what this service was.

andrea.denisse subscribed.Feb 6 2024, 4:49 PM

lmata subscribed.Feb 6 2024, 10:55 PM

Thank you for the report and investigation, I took this chance to update https://wikitech.wikimedia.org/wiki/Thanos and make the titan/thanos-fe split more clear. I've also improved the alerts section to mention the OOM thanos-query condition.

On the specific OOM issue, we are using a big and effective hammer (i.e. cgroup memory limits) and on the Thanos side there are a few things to be tried as experiments, e.g. https://thanos.io/tip/components/query.md/#experimental-promql-engine though my understanding is that the "query of death" scenario is hard to mitigate in all cases

Thanks for the report; we'll continue to investigate and discuss.

We had a repeat at almost exactly the same time today, only this time neither node recovered and both needed power-cycling.
VO incident 4427.

Per docs, Thanos supports logging when a query is received but before it begins execution:

Log details prior to the request and when the request completes - Pre request log is made at debug level.

I don't relish the idea of turning on debug-level logging in production but I think we may have to.

Turn it on at 15:55 UTC? </only-half-joking>

ABran-WMF subscribed.Feb 7 2024, 4:21 PM

Here's a capture of what was logged by apache in the minute leading up to titan1002 hanging https://phabricator.wikimedia.org/P56453

Change 998773 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: tighten memory limits for query/query-frontend

https://gerrit.wikimedia.org/r/998773

gerritbot added a project: Patch-For-Review.Feb 8 2024, 9:10 AM

Current avenues I'm exploring:

Tighten the memory limits, thanos-query memory utilization jumps up very fast and I suspect what happens is that in certain cases there isn't enough memory left for the host to still being usable. Which is worse scenario of course than thanos-query being restarted, and takes longer to recover
Add debug logging to thanos-query as @CDanis pointed out

Change 998786 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: enable request debug for query / query-frontend

https://gerrit.wikimedia.org/r/998786

Change 998773 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: tighten memory limits for query/query-frontend

https://gerrit.wikimedia.org/r/998773

Change 998786 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: enable request debug for query / query-frontend

https://gerrit.wikimedia.org/r/998786

Maintenance_bot removed a project: Patch-For-Review.Feb 8 2024, 11:30 AM

In T356788#9523906, @fgiunchedi wrote:

Current avenues I'm exploring:

Tighten the memory limits, thanos-query memory utilization jumps up very fast and I suspect what happens is that in certain cases there isn't enough memory left for the host to still being usable. Which is worse scenario of course than thanos-query being restarted, and takes longer to recover

Add debug logging to thanos-query as @CDanis pointed out

Both of these are now in place, which should help both in faster recoveries and debugging

Happened again albeit on titan1001 only, where query-frontend and store both using cpu and memory, and the host becoming unresponsive

2024-02-14-114747_1962x1808_scrot.png (1×1 px, 256 KB)

I'll be digging into logs and see if a smoking gun shows up, also all thanos component should live in their own slice with the idea of limiting their resources as a whole

Change 1003439 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: run Thanos components in a systemd slice

https://gerrit.wikimedia.org/r/1003439

gerritbot added a project: Patch-For-Review.Feb 14 2024, 2:39 PM

Change 1003439 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: run Thanos components in a systemd slice

https://gerrit.wikimedia.org/r/1003439

Now Thanos services run in their own slice, which should help with enforcing resource limits.

WRT smoking guns, I couldn't find any obvious ones, though I noticed thanos-query-frontend doesn't cache results in some occasions:

[1706607605704,9223372036854775807]"                                                                                                                                                2024-02-14T09:40:16.050875+00:00 titan1001 thanos-query-frontend[256524]: level=debug ts=2024-02-14T09:40:16.037295765Z caller=results_cache.go:278 msg="Cache-Control

I'm assuming those are requests from thanos.w.o which go through the cdn, as opposed to grafana where requests originate from grafana hosts themselves

Maintenance_bot removed a project: Patch-For-Review.Feb 15 2024, 1:30 PM

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:05:38Z] <claime> restarting thanos-query.service - titan1001 - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:09:59Z] <claime> restarting thanos-query.service - titan1002 - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:22:03Z] <claime> depooling thanos-query eqiad - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:22:46Z] <claime> restarting thanos-query.service - titan1002 - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:26:12Z] <claime> restarting thanos-query.service - titan1001 - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:33:02Z] <claime> repooling thanos-query eqiad - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:48:17Z] <godog> bounce thanos-query on titan2* - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:54:40Z] <godog> bounce thanos-query on titan1* - T356788

Change 1004630 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: set only MemoryMax limit

https://gerrit.wikimedia.org/r/1004630

gerritbot added a project: Patch-For-Review.Feb 19 2024, 11:03 AM

Change 1004630 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: set only MemoryMax limit

https://gerrit.wikimedia.org/r/1004630

Maintenance_bot removed a project: Patch-For-Review.Feb 19 2024, 11:30 AM

Jelto mentioned this in T357888: PuppetZeroResources.Feb 19 2024, 4:01 PM

This happened again today, recovery was better in the sense that titan hosts themselves remained available, the OOM kicked in and things recovered without intervention. The page still went out as probes failed though.

Big memory users were thanos-query-frontend and thanos-store, suggesting this was a query for a long timespan

2024-02-23-115621_2338x1763_scrot.png (1×2 px, 276 KB)

Change 1006000 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: ship tool to analyze query apache access logs

https://gerrit.wikimedia.org/r/1006000

gerritbot added a project: Patch-For-Review.Feb 23 2024, 2:38 PM

Change 1006006 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: limit query length in frontend

https://gerrit.wikimedia.org/r/1006006

Change 1006006 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: limit query length in frontend

https://gerrit.wikimedia.org/r/1006006

Change 1006000 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: ship tool to analyze query apache access logs

https://gerrit.wikimedia.org/r/1006000

Maintenance_bot removed a project: Patch-For-Review.Feb 27 2024, 12:30 PM

This may be worth taking a look as well: T358634

fgiunchedi mentioned this in T358626: Review of log level settings for prometheus-blackbox-exporter and thanos-query.Feb 28 2024, 2:02 PM

Change 1007332 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: revert to standard logging level

https://gerrit.wikimedia.org/r/1007332

gerritbot added a project: Patch-For-Review.Feb 28 2024, 2:06 PM

lmata added a project: SRE Observability (FY2023/2024-Q3).Feb 28 2024, 3:20 PM

lmata moved this task from Inbox to Radar on the observability board.

Change 1007332 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: revert to standard logging level

https://gerrit.wikimedia.org/r/1007332

Maintenance_bot removed a project: Patch-For-Review.Feb 29 2024, 10:30 AM

herron mentioned this in T359879: SLO dashboards for Lift Wing showing unexpected values.Mar 11 2024, 8:28 PM

fgiunchedi edited projects, added SRE Observability (FY2023/2024-Q4); removed SRE Observability (FY2023/2024-Q3).Mar 26 2024, 2:57 PM

Change #1018263 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: parse more queries from thanos access logs

https://gerrit.wikimedia.org/r/1018263

gerritbot added a project: Patch-For-Review.Tue, Apr 9, 12:50 PM

Change #1018263 merged by Filippo Giunchedi: