Page MenuHomePhabricator

thanos query/store OOM on titan hosts
Closed, ResolvedPublic

Description

Today we had a page about thanos-query being unavailable, specifically ProbeDown for thanos-query service.

Turns out it is a similar case as what we've experienced in T356788: thanos-query probedown due to OOM of both eqiad titan frontends, namely a "query of death" consuming most/all resources on titan hosts.

The significant difference in my mind is that the defenses we've put in place were not sufficient from wedging the service. One defense is having put all thanos-related units in thanos.slice and the slice has 85% memory available, with the idea that the OOM killer will kill and restart the offending service.

For reasons still unclear to me, the OOM killer doesn't seem to have kicked in at the time of the outage on titan1* hosts (the ones affected)

See the latter spike, which is the one that caused the outage: https://grafana.wikimedia.org/goto/SzIhuxvHg?orgId=1

Event Timeline

Change #1110798 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos-query: write active queries to file

https://gerrit.wikimedia.org/r/1110798

Change #1110798 merged by Filippo Giunchedi:

[operations/puppet@production] thanos-query: write active queries to file

https://gerrit.wikimedia.org/r/1110798

Yesterday thanos-query paged due to excessive memory/cpu usage on titan hosts, I checked the query logs and it was indeed a case of "query of death" consuming a lot of resources again. Before the holidays we did change caching parameters of thanos components as part of T368953 and T302995

2025-01-16-104508_2508x946_scrot.png (946×2 px, 171 KB)

In particular thanos-store also does spike in memory usage, as a test I'll revert the caching bucket setting for thanos-store and see if that leads to a better handling of query of death scenarios with the resources available.

Change #1111930 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Revert "thanos-store: enable caching bucket"

https://gerrit.wikimedia.org/r/1111930

Change #1111930 merged by Filippo Giunchedi:

[operations/puppet@production] Revert "thanos-store: enable caching bucket"

https://gerrit.wikimedia.org/r/1111930

The change definitely helped, I was able to submit the problematic query sum(rate(mediawiki_WikimediaEvents_editResponseTime_seconds_count[1h]) * 60) with increasingly longer time spans and thanos was able to deal with it without crashing titan hosts. The query does sometimes timeout, though on re-submitting the query it does work as intended

Change #1114336 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: send sigkill as needed to stateless components

https://gerrit.wikimedia.org/r/1114336

Change #1114336 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: send sigkill as needed to stateless components

https://gerrit.wikimedia.org/r/1114336

fgiunchedi claimed this task.

We are enforcing go automemlimit in thanos components, meaning the GC know how much heap it is supposed to have max and will give up trying to allocate much more than that.