Page MenuHomePhabricator

Move thanos cache out of process
Closed, ResolvedPublic

Description

While thanos in-memory caching has worked so far, one consequence is that on process OOM we start with an empty cache.

We have extensive memcache experience and puppetization, thus it shouldn't be a whole lot of work to move thanos caching to memcache

  • Deploy memcache to titan hosts (size tbd)
  • Switch thanos-query-frontend to use memcache on localhost
  • Also move thanos-store cache to memcache

As followups/improvements:

  • Investigate if thanos memcache client does the right thing with multiple servers (i.e. handles failure and sharding)
  • If it does, then we can add all titan hosts to thanos memcache configuration
  • Consider upgrading titan host memory for memcache

Related Objects

Event Timeline

fgiunchedi renamed this task from Move thanos cache out of process to Move thanos-query-frontend cache out of process.May 14 2025, 3:44 PM
fgiunchedi updated the task description. (Show Details)

As part of this I think we should summarize the available cache backends available today and document the reason for selecting the memcached backend. Other options available are:

  • For in-process cache warming there is Thanos groupcache, which would distribute the in process cache across multiple hosts and avoid the need for additional services/clusters.
  • For out of process caching Redis and memcached are both supported, and we do have puppetization for both.

One of the big benefits to out of process cache is clustering/scaling, which should effectively 4x the available cache and address the warming issues at the same time. But of course this comes with cluster management overhead/maintenance considerations. In my view that's probably the main criteria -- beyond a localhost implementation, which of these options would we expect to be easiest to scale into a clustered deployment?

As part of this I think we should summarize the available cache backends available today and document the reason for selecting the memcached backend. Other options available are:

  • For in-process cache warming there is Thanos groupcache, which would distribute the in process cache across multiple hosts and avoid the need for additional services/clusters.

groupcache is marked as experimental AFAICS and it is in-process, whereas we're looking at out of process caching in this task

  • For out of process caching Redis and memcached are both supported, and we do have puppetization for both.

One of the big benefits to out of process cache is clustering/scaling, which should effectively 4x the available cache and address the warming issues at the same time. But of course this comes with cluster management overhead/maintenance considerations. In my view that's probably the main criteria -- beyond a localhost implementation, which of these options would we expect to be easiest to scale into a clustered deployment?

The easiest is memcache, since thanos' memcache client supports multiple addresses and will shard cache keys across all servers listed. In other words, going from localhost to clustered means adding servers to addresses. AFAICS that is not possible with the current redis' thanos client. To achieve clustering with redis my understanding is that we would need to go the redis cluster way, which seems more trouble than worth from the get go. Also note that for the clustered option we'd be clustering site-local servers together (i.e. 2x cache not 4x) to not pay a 100x latency hit.

Thanks, makes sense to me. Cross site latency is a good point, not sure why I had it in mind that we might share across all nodes when realistically it'd be per-site.

When it comes to sizing, assuming the initial testing goes well, I think it'd be fair to consider upgrading memory on the titan hosts specifically for memcached, I'll add an item to the followups.

Change #1154844 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] titan: deploy local memcached

https://gerrit.wikimedia.org/r/1154844

Change #1154845 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] query-frontend: enable memcached on localhost

https://gerrit.wikimedia.org/r/1154845

Change #1154844 merged by Filippo Giunchedi:

[operations/puppet@production] titan: deploy local memcached

https://gerrit.wikimedia.org/r/1154844

Change #1154845 merged by Filippo Giunchedi:

[operations/puppet@production] query-frontend: enable memcached on titan[21]001

https://gerrit.wikimedia.org/r/1154845

Change #1155231 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: enable memcache on all titan hosts

https://gerrit.wikimedia.org/r/1155231

Change #1155231 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: enable memcache on all titan hosts

https://gerrit.wikimedia.org/r/1155231

fgiunchedi renamed this task from Move thanos-query-frontend cache out of process to Move thanos cache out of process.Jun 11 2025, 1:54 PM
fgiunchedi updated the task description. (Show Details)

Change #1156341 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: add memcached-based index caching to store

https://gerrit.wikimedia.org/r/1156341

Change #1156342 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: trial store memcache on titan[12]001

https://gerrit.wikimedia.org/r/1156342

Change #1156343 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] thanos: activate store memcached across the board

https://gerrit.wikimedia.org/r/1156343

Change #1156728 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: restrict titan memcached access

https://gerrit.wikimedia.org/r/1156728

Change #1156728 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: restrict titan memcached access

https://gerrit.wikimedia.org/r/1156728

Change #1156341 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: add memcached-based index caching to store

https://gerrit.wikimedia.org/r/1156341

Change #1156342 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: trial store memcache on titan[12]001

https://gerrit.wikimedia.org/r/1156342

Change #1156343 merged by Filippo Giunchedi:

[operations/puppet@production] thanos: activate store memcached across the board

https://gerrit.wikimedia.org/r/1156343

fgiunchedi claimed this task.

This is completed -- working set size for memcached is ~6GB per host. As a further optimization if needed we could look at sharding the cache amongst host, even though I doubt it'll be needed