Today we had a page about thanos-query being unavailable, specifically ProbeDown for thanos-query service.
Turns out it is a similar case as what we've experienced in T356788: thanos-query probedown due to OOM of both eqiad titan frontends, namely a "query of death" consuming most/all resources on titan hosts.
The significant difference in my mind is that the defenses we've put in place were not sufficient from wedging the service. One defense is having put all thanos-related units in thanos.slice and the slice has 85% memory available, with the idea that the OOM killer will kill and restart the offending service.
For reasons still unclear to me, the OOM killer doesn't seem to have kicked in at the time of the outage on titan1* hosts (the ones affected)
See the latter spike, which is the one that caused the outage: https://grafana.wikimedia.org/goto/SzIhuxvHg?orgId=1
