Page MenuHomePhabricator

Let WANObjectCache store "sister keys" on the same backend as the main value key
Closed, ResolvedPublic

Description

@aaron wrote on Gerrit

This is useful for grouping related keys on the same servers to reduce
the need for cache server connections and availability. A cache key that
uses "lockTSE" can already involve accessing several keys during the
read/write cache-aside paths:
a) The value key itself
b) The check key (named after the main key, a common pattern)
c) The mutex key (used if the value looks stale)
d) The cool-off key (used if regeneration took a while)

Any problems accessing the first two could cause extra value regenerations.
Problems with the mutex key could lead to stampedes due to threads assuming
another thread was regerating a soon-to-expire value when, in fact, none was.
A similar problem could happen with cool-off keys, with threads assuming
that another saved the newly regenerated value when, in fact, none did.

The use of hash stops puts the tiny related keys on the same server as the
main cache key that they serve. This is only for hash-based routing, and not
route prefix routing (e.g. All*Route still sends the key to multiple child
routes, but the PoolRoute/HashRoute function will hash differently).

Motivation

MediaWiki communicates with the Memcached backends via a proxy (Mcrouter) local to the individual app server (not coordinated). Spurious errors from Memcached responses are sometimes interpreted by Mcrouter as being indicative of the backend server (or its network connection) being unhealthy. Sometimes that assumption is correct. Sometimes its not.

In any event, when this happens the individual backend is effectively depooled (in "TKO" mode) for a short time for requests from that particular MW server.

Given that a single logical Memcached key from MW translates to multiple real keys (these secondary "sister" keys store some metadata or interim values etc.), this means a single server being down, a much larger proportion of gets is effectively nulled out for all its reads.

To fix that, we'll change it so that the value key and sister keys route to the same Memcached shard.

Event Timeline

From Gerrit by @aaron:
[mediawiki/core] objectcache: add "coalesceKeys" option to WANObjectCache for key grouping

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/531805/

From Gerrit by @aaron:
[mediawiki/core] objectcache: fix "coalesceKeys" option name in WANObjectCache

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/571791/

From Gerrit by @aaron:
[operations/mediawiki-config] Beta: Enable "coalesceKeys" for WANObjectCache in deployment-prep

https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/571793/

Change 575098 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[operations/mediawiki-config@master] Set "coalesceKeys" in mc.php to minimize host fan-out by WANCache

https://gerrit.wikimedia.org/r/575098

From Gerrit by @aaron:
[mediawiki/core] objectcache: add "non-global" mode to WANObjectCache "coalesceKeys"

https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/589769/

This change will roll out as part of 1.35.0-wmf.32.

Krinkle triaged this task as Medium priority.
Krinkle added a parent task: Restricted Task.
Krinkle moved this task from Inbox to Next: Goal / Jan-Mar '21 on the Performance-Team board.

Change 575098 merged by jenkins-bot:
[operations/mediawiki-config@master] Set "coalesceKeys" to "non-global" for testwiki and mediawikiwiki

https://gerrit.wikimedia.org/r/575098

Change 597895 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] Set "coalesceKeys" to "non-global" for commonswiki

https://gerrit.wikimedia.org/r/597895

Change 597895 merged by jenkins-bot:
[operations/mediawiki-config@master] Set "coalesceKeys" to "non-global" for commonswiki

https://gerrit.wikimedia.org/r/597895

Change 598851 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[operations/mediawiki-config@master] Enable "coalesceKeys"="non-global" for WANCache on commonswiki (II)

https://gerrit.wikimedia.org/r/598851

Change 598851 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable "coalesceKeys"="non-global" for WANCache on commonswiki (II)

https://gerrit.wikimedia.org/r/598851

Change 602935 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] User: Fix pingLimiter() to use makeGlobalKey() for global rate limits

https://gerrit.wikimedia.org/r/602935

Change 598855 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[operations/mediawiki-config@master] Enable "coalesceKeys" for global keys for WANCache

https://gerrit.wikimedia.org/r/598855

Change 598855 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable "coalesceKeys" for global keys for WANCache

https://gerrit.wikimedia.org/r/598855

Revert "Enable "coalesceKeys" for global keys for WANCache"

https://gerrit.wikimedia.org/r604211

During testing on mwdebug1002, it was found to consistently cause 1-2 minutes of forced read-only mode on all wikis due to a bad db lag fallback state. Cause unknown. To be determined.

Change 607155 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/mediawiki-config@master] [DNM] Enable "coalesceKeys" for global keys for WANCache (II)

https://gerrit.wikimedia.org/r/607155

Krinkle raised the priority of this task from Medium to High.Jul 29 2020, 2:09 AM

Change 607155 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable "coalesceKeys" for global keys for WANCache (II)

https://gerrit.wikimedia.org/r/607155

Change 655809 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix bogus read-only mode bug in LoadBalancer

https://gerrit.wikimedia.org/r/655809

Change 655809 merged by jenkins-bot:
[mediawiki/core@master] rdbms: fix bogus read-only mode bug in LoadBalancer

https://gerrit.wikimedia.org/r/655809

Change 658372 had a related patch set uploaded (by Krinkle; owner: Aaron Schulz):
[operations/mediawiki-config@master] Enable "coalesceKeys" for global keys for WANCache (III)

https://gerrit.wikimedia.org/r/658372

Change 658372 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable "coalesceKeys" for global keys for WANCache (III)

https://gerrit.wikimedia.org/r/658372

Change 661832 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@wmf/1.36.0-wmf.27] rdbms: fix bogus read-only mode bug in LoadBalancer

https://gerrit.wikimedia.org/r/661832

Change 661832 merged by jenkins-bot:
[mediawiki/core@wmf/1.36.0-wmf.27] rdbms: fix bogus read-only mode bug in LoadBalancer

https://gerrit.wikimedia.org/r/661832