Page MenuHomePhabricator

Thanos Cache Tuning
Open, MediumPublic

Assigned To
None
Authored By
herron
Jul 1 2024, 5:15 PM
Referenced Files
F58030342: Screenshot 2024-12-19 at 10.50.28 AM.png
Dec 19 2024, 5:31 PM
F58030242: Screenshot 2024-12-19 at 10.14.20 AM.png
Dec 19 2024, 5:31 PM
Restricted File
Dec 18 2024, 4:50 PM
Restricted File
Dec 18 2024, 4:50 PM
Restricted File
Dec 18 2024, 4:50 PM
Restricted File
Dec 18 2024, 4:50 PM
Restricted File
Dec 18 2024, 4:50 PM

Description

Now that we've upgraded memory on the titan (thanos) hosts from 32G to 128G we have headroom to explore some cache tuning/increases.

I created a dashboard to show high level cache overview here https://grafana-rw.wikimedia.org/d/c2b5ccc9-0c16-45ae-99bc-a244e3f73808/thanos-cache-overview?orgId=1&from=now-7d&to=now

and am thinking a reasonable initial target would be to aim for cache hits above 90%, and minimizing cache overflows.

In particular I'm interested to see what if any improvements this can make for known slow queries (for instance istio recording rules)

Event Timeline

herron triaged this task as Medium priority.

Change #1051177 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos: increase query frontend and store cache sizes

https://gerrit.wikimedia.org/r/1051177

Change #1051177 merged by Herron:

[operations/puppet@production] thanos: increase query frontend and store cache sizes

https://gerrit.wikimedia.org/r/1051177

Change #1105037 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos-store: enable caching bucket

https://gerrit.wikimedia.org/r/1105037

A few more cache tuning patches that should have been linked on this task:

Change #1105037 merged by Herron:

[operations/puppet@production] thanos-store: enable caching bucket

https://gerrit.wikimedia.org/r/1105037

Initial results with Thanos store caching bucket enabled look promising. I'm seeing reductions in duration, errors, network/socket utilization and slight decrease in cpu util

{F58028229} {F58028264}{F58028221} {F58028224}

Along with some increases in disk read/write (see above) and GC time, should be able to tune these out if they become a problem

{F58028233}

I added new panels to track Thanos store caching bucket metrics here https://grafana-rw.wikimedia.org/d/c2b5ccc9-0c16-45ae-99bc-a244e3f73808/thanos-cache-overview

Since we're staring to see some positive effects I think its worth trying to tune this further. Theres still room for improvement in iter and get hits, and in bucket cache evictions overall. Also you can see Thanos store quickly consumed all the memory available for cache

Next things to try that come to mind...

  • increase the store cache memory size aiming to reduce evictions
  • disable max_item_size (as was done for query frontend)
  • increase chunk pool size (afaict we're running the default 2GB, try something like doubling to 4GB and review)
  • consider persistent clustered caching backend or inmemory groupcache to increase memory available to cache, speed up warming and spread load across the titan hosts.

Change #1105389 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos-store: manage and increase chunk-pool-size setting

https://gerrit.wikimedia.org/r/1105389

Change #1105395 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] thanos-store: increase store cache size to 24GB

https://gerrit.wikimedia.org/r/1105395

Initial results with Thanos store caching bucket enabled look promising. I'm seeing reductions in duration, errors, network/socket utilization and slight decrease in cpu util

{F58028229} {F58028264}{F58028221} {F58028224}

Along with some increases in disk read/write (see above) and GC time, should be able to tune these out if they become a problem

{F58028233}

I added new panels to track Thanos store caching bucket metrics here https://grafana-rw.wikimedia.org/d/c2b5ccc9-0c16-45ae-99bc-a244e3f73808/thanos-cache-overview

Since we're staring to see some positive effects I think its worth trying to tune this further. Theres still room for improvement in iter and get hits, and in bucket cache evictions overall. Also you can see Thanos store quickly consumed all the memory available for cache

Thank you for looking into this -- AFAICS the caching bucket helped a little on the thanos-store side to reduce CPU usage and the expense of more memory used. Bandwidth dropped too, though by not very much, for example looking at thanos2001 host stats for the last ~7d (i.e. to show the increase when we dropped the recording rules, and the drop after caching bucket) https://grafana.wikimedia.org/goto/6YJ6C1SNg

Next things to try that come to mind...

  • increase the store cache memory size aiming to reduce evictions
  • disable max_item_size (as was done for query frontend)
  • increase chunk pool size (afaict we're running the default 2GB, try something like doubling to 4GB and review)
  • consider persistent clustered caching backend or inmemory groupcache to increase memory available to cache, speed up warming and spread load across the titan hosts.

In my mind the problem lies with the amount of work thanos-query has to do to answer pyrra's queries without recording rules (i.e. CPU to process the whole 12w of raw data coming back from thanos-store). Indeed, while thanos-store cpu dropped with caching bucket, thanos-query CPU usage remained unchanged:
https://grafana.wikimedia.org/goto/BfwtC1INR since the amount of data/metrics to process hasn't changed.

In other words, while we could further tune thanos-store, it won't change the nature and sustainability of the problem we're facing. I think we should go back to the recording rules before the holiday period to go back to a known state, and reevaluate/debug/diagnose the gap problem in Jan, in addition to the steps I've outlined in https://phabricator.wikimedia.org/T302995#10409335

Looking again today with some more time passed I think the good news is we've dropped rx bandwidth from roughly ~250MB/s to ~150MB/s sustained, in the ballpark of a 40% reduction.

Screenshot 2024-12-19 at 10.14.20 AM.png (1×2 px, 244 KB)

And cpu load reduction from thanos store caching bucket does reduce overall cpu load by ~15-20% since they are on the same host.

Screenshot 2024-12-19 at 10.50.28 AM.png (534×1 px, 151 KB)

Not to say the above is ideal end state, but I'm hoping it is enough improvement to tide us over the break.

Re: reverting liftwing to previous recording rules, the trouble is it'd be a revert to a broken state in Pyrra where the SLOs would be recording bad values and would activate several burn alerts. If we do need to revert the safest thing to do would be offboard of the liftwing SLOs, but it'd be nice to avoid that if we can. Would it be fair to prep an offboard patch and keep it in our back pocket in case of issue?

Re: Follow up/next steps in Jan I agree with you. IMO makes sense to pursue both

Re: reverting liftwing to previous recording rules, the trouble is it'd be a revert to a broken state in Pyrra where the SLOs would be recording bad values and would activate several burn alerts. If we do need to revert the safest thing to do would be offboard of the liftwing SLOs, but it'd be nice to avoid that if we can. Would it be fair to prep an offboard patch and keep it in our back pocket in case of issue?

Yes the revert patch is certainly an option, in addition to the revert please make sure to flag this issue/task/patch in "Temporary incident response steps" google doc during the holidays as a potential for incident.

Alternatively, let's please offboard the problematic liftwing SLOs before the holidays and get thanos/titan to a pre-raw-metric load. AFAICS said SLOs have been onboarded on Dec 10th with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1101911 and seems fine to me to go back on a 10 days old change.

Change #1105921 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] pyrra: remove liftwing slos

https://gerrit.wikimedia.org/r/1105921

Change #1105921 merged by Herron:

[operations/puppet@production] pyrra: remove liftwing slos

https://gerrit.wikimedia.org/r/1105921

let's please offboard the problematic liftwing SLOs before the holidays and get thanos/titan to a pre-raw-metric load

Ok, done -- Let's sync up in Jan re: T302995#10418909

Change #1105395 abandoned by Herron:

[operations/puppet@production] thanos-store: increase store cache size to 24GB

Reason:

no longer using internal cache

https://gerrit.wikimedia.org/r/1105395

Change #1105389 abandoned by Herron:

[operations/puppet@production] thanos-store: manage and increase chunk-pool-size setting

Reason:

https://gerrit.wikimedia.org/r/1105389