Page MenuHomePhabricator

Re-evaluate need for "cool-off bounce" in WANObjectCache
Closed, ResolvedPublic

Description

Background

This feature was added for T203786 which tracked WMF prod incidents as result of its 1G network links between mw servers and memc servers being saturated during peak traffic due to high 'get' transfers and possibly high 'set' transfers after cache miss.

This then resulted in mcrouter declaring a host as "TKO" and locally returning false for a fixed period of time for all memc requests, which then result in MW outage.

Meanwhile

  1. The memcached hosts have had their network links upgraded from 1G to 10G at WMF, massively increasing capacity and making the issue hard to re-occur.
  1. WMF SRE introduced an on-host memcached tier on each appserver, for storing ParserCache values. The ParserObject cache values are the largest values transferred, thus further reducing the issue.
  1. Plus, a memcached gutterpool was added as failover to mcrouter, which means even if it can still happen, it no longer results in an unwritable cache backend.
  1. Plus, with multi-dc deployed this year, traffic is now generally split between the DCs, thus further reducing congestion between appservers and memc.
  1. General improvements to MediaWiki, PHP upgrades, MariaDB and kernel/CPU tuning have reduced cache miss latencies thus further reducing the odds of misses causing an outage.

Proposal

The Cool-off bounce feature in WANObjectCache adds significant complexity during the getWithSet method, in particular the duplicate cost of object serialisation, and the extra network ADD roundtrip to coordinate the conditional set.

It seems probable that the issue was highly specific to WMF, and also no longer applicable there, hence this experiment to measure where we are, and what the benefits of removing it might be.

I suggest we decide on one or more scenarios for how we want to prove the need for it, and then with an ad-hoc patch applied that disables the feature, prove or disprove the need for it. E.g. identify where the bottleneck is in general with a large traffic influx, and then more specifically a scenario where a major memc key is missing (one where other WANObjectCache protections like lockTSE and interim are unsuccessful) and we get a stampede; whether that can lead to an outage before a different bottleneck becomes the reason for the outage.

Event Timeline

Change 830706 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] [DNM] objectcache: Disable cool-off bounce feature

https://gerrit.wikimedia.org/r/830706

Krinkle triaged this task as Medium priority.Oct 27 2022, 5:06 PM
Krinkle added a subscriber: jijiki.

From talking with Aaron and Tim, we'd like to quantify how much bandwidth ParserOutput's on-host-memcached tier is absorbing during high load scenarios, compared to the main memcached traffic from the same server. The result of that would then help us decide whether we think cool-off is still needed.

Another scenario (see task description) is temporarily disabling cool-off and quantifying how much it helps today.

From today's Perf:ServiceOps sync, @Joe recommended we with with @jijiki on both of these.

Change 853455 had a related patch set uploaded (by Aaron Schulz; author: Aaron Schulz):

[mediawiki/core@master] objectcache: avoid serialize() in WANObjectCache::checkAndSetCooloff()

https://gerrit.wikimedia.org/r/853455

Change 853455 merged by jenkins-bot:

[mediawiki/core@master] objectcache: avoid serialize() in WANObjectCache::checkAndSetCooloff()

https://gerrit.wikimedia.org/r/853455

Change 853455 merged by jenkins-bot:

[mediawiki/core@master] objectcache: avoid serialize() in WANObjectCache::checkAndSetCooloff()

https://gerrit.wikimedia.org/r/853455

Impact:

https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?from=now-9d&orgId=1&to=now&var-kClass=SqlBlobStore_blob

Screenshot 2022-11-26 at 07.38.55.png (913×2 px, 221 KB)

The orange-highlighted graph is the one that matters for this change (the "set" overhead). The red line around 24 November marks when the above commit was deployed to all wikis on Thursday. The drop around 17 November is when the WRITE_BACKGROUND optimisation was deployed a week earlier (details at T302623#8404657).

This time, it did not make a notable impact one way or the other. Give that the optimisation made the code simpler, however, I'd say it's worth keeping regardless as it hasn't made it slower either.

Change 902376 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@wmf/1.41.0-wmf.1] objectcache: Disable cool-off bounce feature

https://gerrit.wikimedia.org/r/902376

Patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/908027 was deployed to wmf/1.41.0-wmf.4 on Tue 11 Apr / early morning Wed 12 Apr, ahead of the group1 and group2 deployments later that week. In the SAL, https://sal.toolforge.org/production?q=1.41.0-wmf.4 we see wmf.4 reached all wikis on Thursday 14 April, between 18:34 and 21:10 UTC (T330210). The train took longer than usual due to unrelated reasons, which gives us a pretty wide timewindow to compare before-and-after, not ideal.

My hypothesis was: Cool-off bounces have no benefit under normal circumstances, and have positive impact under heavy load (less cache write traffic) but that this difference is no longer a bottleneck, and that under normal circumstances its overhead may actually be add a small but measurable latency cost that its removal would save.

To my knowledge, we weren't under attack in the last 5 days in the specific way that this feature is meant to mitigate. Hence we're only testing the first and third part of my hypothesis. The middle one is SRE responsibility (SvcOps: Joe, Effie), and they've cleared removal of this feature with the understanding that it is indeed likely no longer needed, and that we can always bring it back after a small number of incidents that may or may not bring us down anyway.

Experiment review

I took the top cache keys by miss rate from Grafana: WANObjectCache, and reviewed their metrics (the "Top" panel link to Grafana: WANCache by Key group). These were sqlblobstore_blob, page, filerepo_file_foreign_description, and revision_row_1.29. I expanded the time window to 28 days so that we're not just comparing before-weekdays to after-weekend, but also week over week for the day.

It looks like the set overhead may've gone from ~0.05ms to ~0.04ms for SqlBlobStore_blob and page_content_model (both have a high-miss frequency at 2M/minute and 200K/min; both cache-miss ratio of 50%). It's too fuzzy to tell though. It'd be great to have this on Prometheus so that we get less per-minute variance and can compare in a more equal-weighted fashion over several hours at once (ding ding T240685).

On the infra side, Grafana: Memcached (dc=codfw) and Grafana: Mcrouter (dc=codfw, cluster=appserver) did not show a noticable change in traffic. Not an increase in bandwidth, nor a notable decrease in ADD/lock or SET operations. The regular locks and sets from WANCache in general (not its cool-off feature) seem to far outweigh these. That's confirms it was working correctly or at least it wasn't over-protective until now.

Change 830706 merged by jenkins-bot:

[mediawiki/core@master] objectcache: Remove WANObjectCache's internal cool-off bounce feature

https://gerrit.wikimedia.org/r/830706