Page MenuHomePhabricator

thanos-query probedown due to OOM of both eqiad titan frontends
Open, Needs TriagePublic

Description

We got paged today:

https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All thanos-query:443 failed when probed by http_thanos-query_ip4 from eqiad. Availability is 0%. https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2?_g=(filters:!((query:(match_phrase:(service.name:http_thanos-query_ip4))))) https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page
Alerts Firing:
Labels:
 - alertname = ProbeDown
 - address = 10.2.2.53
 - family = ip4
 - instance = thanos-query:443
 - job = probes/service
 - module = http_thanos-query_ip4
 - prometheus = ops
 - severity = page
 - site = eqiad
 - source = prometheus
 - team = sre
Annotations:
 - dashboard = https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All
 - description = thanos-query:443 failed when probed by http_thanos-query_ip4 from eqiad. Availability is 0%.
 - logs = https://logstash.wikimedia.org/app/dashboards#/view/f3e709c0-a5f8-11ec-bf8e-43f1807d5bc2?_g=(filters:!((query:(match_phrase:(service.name:http_thanos-query_ip4)))))
 - runbook = https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443
 - summary = Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page
Source: https://prometheus-eqiad.wikimedia.org/ops/graph?g0.expr=%28avg_over_time%28probe_success%7Bjob%3D~%22probes%2F.%2A%22%2Cmodule%3D~%22%28http%7Ctcp%29.%2A%22%7D%5B1m%5D%29+and+on+%28instance%29+service_catalog_page+%3D%3D+1%29+%2A+100+%3C+10&g0.tab=1

On both eqiad titan nodes, the OOM killer came for thanos:

2024-02-06T16:02:48.599094+00:00 titan1002 kernel: [529285.100137] Out of memory: Killed process 944 (thanos) total-vm:25141640kB, anon-rss:20740672kB, file-rss:0kB, shmem-rss:0kB, UID:111 pgtables:47124kB oom_score_adj:0
2024-02-06T16:02:42.108834+00:00 titan1001 kernel: [534723.800913] Out of memory: Killed process 935 (thanos) total-vm:25080384kB, anon-rss:24252060kB, file-rss:0kB, shmem-rss:0kB, UID:111 pgtables:47820kB oom_score_adj:0

Systemd notices and restarts the service, so the incident self-resolved:

2024-02-06T16:02:48.599092+00:00 titan1002 kernel: [529285.100104] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/thanos-query.service,task=thanos,pid=944,uid=111
2024-02-06T16:02:48.790450+00:00 titan1002 systemd[1]: thanos-query.service: A process of this unit has been killed by the OOM killer.
2024-02-06T16:02:49.814773+00:00 titan1002 systemd[1]: thanos-query.service: Main process exited, code=killed, status=9/KILL
2024-02-06T16:02:49.814901+00:00 titan1002 systemd[1]: thanos-query.service: Failed with result 'oom-kill'.
2024-02-06T16:02:49.815302+00:00 titan1002 systemd[1]: thanos-query.service: Consumed 11h 54min 16.327s CPU time.
2024-02-06T16:02:49.916665+00:00 titan1002 systemd[1]: thanos-query.service: Scheduled restart job, restart counter is at 1.
2024-02-06T16:02:49.917068+00:00 titan1002 systemd[1]: Stopped thanos-query.service - Thanos query.
2024-02-06T16:02:49.917829+00:00 titan1002 systemd[1]: thanos-query.service: Consumed 11h 54min 16.327s CPU time.
2024-02-06T16:02:49.963018+00:00 titan1002 systemd[1]: Started thanos-query.service - Thanos query.

My best guess is that a particular query is what caused the memory overuse (since that would explain the almost-identical timings?), but that might warrant further investigation. It would be nice if user error didn't kill the service, however briefly. And obviously if that happens frequently enough eventually systemd will decide the service is failed and stop trying to restart it.

From an oncall perspective, too, it'd be nice if the thanos docs talked about which things are served by titan* rather than thanos-fe* (and maybe the alert could be a bit clearer in this regard too)?

Event Timeline

Potentially a similar issue (request/traffic related high load) happened around the past 28 of September, when I added this TODO: https://wikitech.wikimedia.org/w/index.php?title=Thanos&diff=prev&oldid=2115837 I was confused back then even what this service was.

Thank you for the report and investigation, I took this chance to update https://wikitech.wikimedia.org/wiki/Thanos and make the titan/thanos-fe split more clear. I've also improved the alerts section to mention the OOM thanos-query condition.

On the specific OOM issue, we are using a big and effective hammer (i.e. cgroup memory limits) and on the Thanos side there are a few things to be tried as experiments, e.g. https://thanos.io/tip/components/query.md/#experimental-promql-engine though my understanding is that the "query of death" scenario is hard to mitigate in all cases

Thanks for the report; we'll continue to investigate and discuss.

We had a repeat at almost exactly the same time today, only this time neither node recovered and both needed power-cycling.
VO incident 4427.

Per docs, Thanos supports logging when a query is received but before it begins execution:

Log details prior to the request and when the request completes - Pre request log is made at debug level.

I don't relish the idea of turning on debug-level logging in production but I think we may have to.

Turn it on at 15:55 UTC? </only-half-joking>

Here's a capture of what was logged by apache in the minute leading up to titan1002 hanging https://phabricator.wikimedia.org/P56453

Change 998773 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: tighten memory limits for query/query-frontend

https://gerrit.wikimedia.org/r/998773

Current avenues I'm exploring:

  • Tighten the memory limits, thanos-query memory utilization jumps up very fast and I suspect what happens is that in certain cases there isn't enough memory left for the host to still being usable. Which is worse scenario of course than thanos-query being restarted, and takes longer to recover
  • Add debug logging to thanos-query as @CDanis pointed out

Change 998786 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: enable request debug for query / query-frontend

https://gerrit.wikimedia.org/r/998786

Change 998773 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: tighten memory limits for query/query-frontend

https://gerrit.wikimedia.org/r/998773

Change 998786 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: enable request debug for query / query-frontend

https://gerrit.wikimedia.org/r/998786

Current avenues I'm exploring:

  • Tighten the memory limits, thanos-query memory utilization jumps up very fast and I suspect what happens is that in certain cases there isn't enough memory left for the host to still being usable. Which is worse scenario of course than thanos-query being restarted, and takes longer to recover
  • Add debug logging to thanos-query as @CDanis pointed out

Both of these are now in place, which should help both in faster recoveries and debugging

Happened again albeit on titan1001 only, where query-frontend and store both using cpu and memory, and the host becoming unresponsive

2024-02-14-114747_1962x1808_scrot.png (1×1 px, 256 KB)

I'll be digging into logs and see if a smoking gun shows up, also all thanos component should live in their own slice with the idea of limiting their resources as a whole

Change 1003439 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: run Thanos components in a systemd slice

https://gerrit.wikimedia.org/r/1003439

Change 1003439 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: run Thanos components in a systemd slice

https://gerrit.wikimedia.org/r/1003439

Now Thanos services run in their own slice, which should help with enforcing resource limits.

WRT smoking guns, I couldn't find any obvious ones, though I noticed thanos-query-frontend doesn't cache results in some occasions:

[1706607605704,9223372036854775807]"                                                                                                                                                2024-02-14T09:40:16.050875+00:00 titan1001 thanos-query-frontend[256524]: level=debug ts=2024-02-14T09:40:16.037295765Z caller=results_cache.go:278 msg="Cache-Control

I'm assuming those are requests from thanos.w.o which go through the cdn, as opposed to grafana where requests originate from grafana hosts themselves

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:05:38Z] <claime> restarting thanos-query.service - titan1001 - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:09:59Z] <claime> restarting thanos-query.service - titan1002 - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:22:03Z] <claime> depooling thanos-query eqiad - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:22:46Z] <claime> restarting thanos-query.service - titan1002 - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:26:12Z] <claime> restarting thanos-query.service - titan1001 - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:33:02Z] <claime> repooling thanos-query eqiad - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:48:17Z] <godog> bounce thanos-query on titan2* - T356788

Mentioned in SAL (#wikimedia-operations) [2024-02-19T10:54:40Z] <godog> bounce thanos-query on titan1* - T356788

Change 1004630 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: set only MemoryMax limit

https://gerrit.wikimedia.org/r/1004630

Change 1004630 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: set only MemoryMax limit

https://gerrit.wikimedia.org/r/1004630

This happened again today, recovery was better in the sense that titan hosts themselves remained available, the OOM kicked in and things recovered without intervention. The page still went out as probes failed though.

Big memory users were thanos-query-frontend and thanos-store, suggesting this was a query for a long timespan

2024-02-23-115621_2338x1763_scrot.png (1×2 px, 276 KB)

Change 1006000 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: ship tool to analyze query apache access logs

https://gerrit.wikimedia.org/r/1006000

Change 1006006 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: limit query length in frontend

https://gerrit.wikimedia.org/r/1006006

Change 1006006 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: limit query length in frontend

https://gerrit.wikimedia.org/r/1006006

Change 1006000 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: ship tool to analyze query apache access logs

https://gerrit.wikimedia.org/r/1006000

This may be worth taking a look as well: T358634

Change 1007332 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: revert to standard logging level

https://gerrit.wikimedia.org/r/1007332

Change 1007332 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: revert to standard logging level

https://gerrit.wikimedia.org/r/1007332

Change #1018263 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: parse more queries from thanos access logs

https://gerrit.wikimedia.org/r/1018263

Change #1018263 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: parse more queries from thanos access logs

https://gerrit.wikimedia.org/r/1018263