Background
This feature was added for T203786 which tracked WMF prod incidents as result of its 1G network links between mw servers and memc servers being saturated during peak traffic due to high 'get' transfers and possibly high 'set' transfers after cache miss.
This then resulted in mcrouter declaring a host as "TKO" and locally returning false for a fixed period of time for all memc requests, which then result in MW outage.
Meanwhile
- The memcached hosts have had their network links upgraded from 1G to 10G at WMF, massively increasing capacity and making the issue hard to re-occur.
- WMF SRE introduced an on-host memcached tier on each appserver, for storing ParserCache values. The ParserObject cache values are the largest values transferred, thus further reducing the issue.
- Plus, a memcached gutterpool was added as failover to mcrouter, which means even if it can still happen, it no longer results in an unwritable cache backend.
- Plus, with multi-dc deployed this year, traffic is now generally split between the DCs, thus further reducing congestion between appservers and memc.
- General improvements to MediaWiki, PHP upgrades, MariaDB and kernel/CPU tuning have reduced cache miss latencies thus further reducing the odds of misses causing an outage.
Proposal
The Cool-off bounce feature in WANObjectCache adds significant complexity during the getWithSet method, in particular the duplicate cost of object serialisation, and the extra network ADD roundtrip to coordinate the conditional set.
It seems probable that the issue was highly specific to WMF, and also no longer applicable there, hence this experiment to measure where we are, and what the benefits of removing it might be.
I suggest we decide on one or more scenarios for how we want to prove the need for it, and then with an ad-hoc patch applied that disables the feature, prove or disprove the need for it. E.g. identify where the bottleneck is in general with a large traffic influx, and then more specifically a scenario where a major memc key is missing (one where other WANObjectCache protections like lockTSE and interim are unsuccessful) and we get a stampede; whether that can lead to an outage before a different bottleneck becomes the reason for the outage.
