WANObjectCache::getWithSetCallback seems not to set objects when fetching data is slow
Closed, ResolvedPublic
Actions

Description

While investigating the strange behaviour of Babel::getCachedUserLanguageInfo() as a followup for an incident, in which to slow responses from the backend we ended up having a much higher rate of cache misses, I went to dig deeper in what was wrong with the code path in question.

From reading extensively the documentation and a couple conversations with @Krinkle I was convinced that WanObjectCache::getWithSetCallback as used in the babel extension (see https://gerrit.wikimedia.org/g/mediawiki/extensions/Babel/+/4606c64ee5f42ee841ffb4882fbc52df808da30a/includes/Babel.php#323) should cache the result as an empty array in case no result is returned within 10 seconds (the timeout). Data from past outages suggested the contrary.

So to test this hypothesis in our production setting I ran a small http server written in golang that would listen on port 9099 and respond "ok" after 15 seconds. If curious, you can see the code at P10386.

I wrote the follwing code, to be run with shell.php:

use MediaWiki\MediaWikiServices;
$cache = MediaWikiServices::getInstance()->getMainWANObjectCache();
$key = "joe::test";
// Delete any leftover key, and do not produce a tombstone.
$cache->delete($key, 0);
// Now let's simulate a situation where we have a remote responding only after 15 seconds.
function get_set( $cache, $key) {
    $curtime = time();
    $val = $cache->getWithSetCallback(
        $key,
        100,
        function ( $oldValue, &$ttl, array &$setOpts ) use ( $curtime ) {
            $req = MWHttpRequest::factory( 'http://localhost:9099', [ 'timeout' => 10], __METHOD__ );
            $status = Status::wrap( $req->execute() );
            if ( !$status->isOK() ) {
                echo "Timeout! $curtime\n";
                return $curtime;
            } else {
                return 0;
            }
        },
        ['checkKeys' => [ $key ]]
   );
    if ( $val === 0 ) {
        echo "$curtime - No timeout\n";
    } elseif ($val == $curtime) {
        echo "$curtime - cache miss\n";
    } else {
        echo "$curtime - cache hit\n";
    }
}
while (true) {
    get_set($cache, $key);
    sleep(1);
}

the result I expected was to get a first cache miss, then for 100 seconds to get cache hits.

To my surprise, the result was very different:

mwmaint1002:~$ mwscript shell.php --wiki enwiki 
Psy Shell v0.9.12 (PHP 7.2.26-1+0~20191218.33+debian9~1.gbpb5a340+wmf1 — cli) by Justin Hileman

[CUT]

Timeout! 1581435967
1581435967 - cache miss
Timeout! 1581435978
1581435978 - cache miss
Timeout! 1581435989
1581435989 - cache miss
Timeout! 1581436000
1581436000 - cache miss
Timeout! 1581436011
1581436011 - cache miss
Timeout! 1581436022
1581436022 - cache miss
Timeout! 1581436033
1581436033 - cache miss
Timeout! 1581436044
1581436044 - cache miss
Timeout! 1581436055
1581436055 - cache miss
Timeout! 1581436066
1581436066 - cache miss
Timeout! 1581436077
1581436077 - cache miss
...

To add more mistery, if I reduce the timeout to 2 seconds, the results are quite different:

mwmaint1002:~$ mwscript shell.php --wiki enwiki 
Psy Shell v0.9.12 (PHP 7.2.26-1+0~20191218.33+debian9~1.gbpb5a340+wmf1 — cli) by Justin Hileman
 
[CUT]
Timeout! 1581436212
1581436212 - cache miss
1581436215 - cache hit
1581436216 - cache hit
1581436217 - cache hit
1581436218 - cache hit
1581436219 - cache hit
1581436220 - cache hit
1581436221 - cache hit
1581436222 - cache hit
1581436223 - cache hit
1581436224 - cache hit
1581436225 - cache hit
1581436226 - cache hit
1581436227 - cache hit
1581436228 - cache hit
1581436229 - cache hit
1581436230 - cache hit
1581436231 - cache hit
1581436232 - cache hit
1581436233 - cache hit
1581436234 - cache hit
1581436235 - cache hit
...

So it seems that the slower the fetching function is, the least probability its result has to be cached. This is the opposite of what I would consider intuitively important to do.

Please forgive me if I did some basic mistake, my PHP is rusty and my knowledge of MediaWiki is relatively scarce.

Details

	Subject	Repo	Branch	Lines +/-
	objectcache: make WANObjectCache::set() handle very slow regeneration	mediawiki/core	master	+135 -63

Customize query in gerrit

Related Objects

Mentioned Here: P10386 A very lame http server

Event Timeline

Joe created this task.Feb 11 2020, 3:52 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 11 2020, 3:52 PM

Joe updated the task description. (Show Details)Feb 11 2020, 3:53 PM

jbond subscribed.Feb 11 2020, 3:59 PM

CDanis subscribed.Feb 11 2020, 4:01 PM

RLazarus subscribed.Feb 11 2020, 4:02 PM

akosiaris subscribed.Feb 11 2020, 4:09 PM

another test, using a simple sleep:

use MediaWiki\MediaWikiServices;
$cache = MediaWikiServices::getInstance()->getMainWANObjectCache();
$key = "joe::test";
$cache->delete($key, 0);
// Now let's simulate a situation where we have a remote responding only after 15 seconds.
function get_set( $cache, $key) {
    $curtime = time();
    $val = $cache->getWithSetCallback(
        $key,
        100,
        function ( $oldValue, &$ttl, array &$setOpts ) use ( $curtime ) {
            sleep(10);
            return $curtime;
        },
        ['checkKeys' => [ $key ]]
   );
    if ( $val === 0 ) {
        echo "$curtime - No timeout\n";
    } elseif ($val == $curtime) {
        echo "$curtime - cache miss\n";
    } else {
        echo "$curtime - cache hit\n";
    }
}
while (true) {
    get_set($cache, $key);
    sleep(1);
}

yields the same results, as expected - consistent cache misses when I set a 10 seconds sleep, and cache hits if I set the sleep to 2 seconds.
It is important to note that the same behaviour happens if I remove the checkkeys option.

Joe updated the task description. (Show Details)Feb 11 2020, 4:37 PM

Joe updated the task description. (Show Details)

Joe updated the task description. (Show Details)Feb 11 2020, 4:41 PM

At @Krinkle's suggestion, I ran two parallel processes and I did first ran get_set in one, then in the other one, and I got a cache miss both times.

Krinkle moved this task from General to libs/objectcache on the MediaWiki-libs-BagOStuff board.Feb 11 2020, 4:55 PM

Krinkle added a project: Platform Engineering.

I can reproduce this locally. I tried it with nodejs app (just because):

var http = require('http');

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var server = http.createServer(async function (request, response) {
  await sleep(15000);
  response.writeHead(200, {"Content-Type": "text/plain"});
  response.end("Hello World\n");
});
server.listen(9099);

console.log("Server running at http://127.0.0.1:8000/");

I think this is a bug in the application logic, since even in localhost there's no network latency to complicate things

Trying it in debug mode:

[objectcache] Rejected set() for joe::test due to snapshot lag.
[objectcache] fetchOrRegenerate(joe::test): miss, new value computed
[objectcache] Rejected set() for joe::test due to snapshot lag.
[objectcache] fetchOrRegenerate(joe::test): miss, new value computed
[objectcache] Rejected set() for joe::test due to snapshot lag.
[objectcache] fetchOrRegenerate(joe::test): miss, new value computed
[objectcache] Rejected set() for joe::test due to snapshot lag.
[objectcache] fetchOrRegenerate(joe::test): miss, new value computed
[objectcache] Rejected set() for joe::test due to snapshot lag.
[objectcache] fetchOrRegenerate(joe::test): miss, new value computed

TK-999 subscribed.Feb 11 2020, 5:20 PM

Apparently this is a built-in and not-properly documented (I didn't know it existed) feature to avoid storing stale data (whatever that means). The default value for stale data is 7 seconds (I guess chosen arbitrary). And anything taking longer than that if don't set lockTSE in their caching option would be regenerated all the time. It gets even better. The logic is buggy and if you even set lockTSE at this conditions. The WANCache decides to lower the TTL to one second (another arbitrary number). Which in our case it means the cache is completely useless and it would regenerate everything instead. This I think is a major factor in our Kobe Bryant outage and answers lots of questions.

this logic was added in I83505a59cce0119e456191c3100f7d97bc86bdbf in 2015.

[edit conflict ;)]

The debug messages point to this code path in WANObjectCache.php:

		// Check if there's a risk of writing stale data after the purge tombstone expired
		if ( $lag === false || ( $lag + $age ) > self::MAX_READ_LAG ) {
			// Case A: any long-running transaction
			if ( $age > self::MAX_READ_LAG ) {
				if ( $lockTSE >= 0 ) {
				    /* .... */
				} else {
					$this->logger->info(
						'Rejected set() for {cachekey} due to snapshot lag.',
						[ 'cachekey' => $key, 'lag' => $lag, 'age' => $age ]
					);
					return true; // no-op the write for being unsafe
				}

lines 630 and on

And in this file we can also see the difference between 2 and 10 seconds:

	/** @var int Max expected seconds of combined lag from replication and view snapshots */
	const MAX_READ_LAG = 7;

line 155

$age cares about opts['since'], which in this codepath comes from fetchOrRegenerate():

				$finalSetOpts = [
					// @phan-suppress-next-line PhanUselessBinaryAddRight
					'since' => $setOpts['since'] ?? $preCallbackTime,

line 1493

$preCallbackTime is exactly what it sounds like: the time before the user-provided callback ran.

The case of Rejected set() for {cachekey} due to snapshot lag. is meant to cover situations like this:

$wanCache->getWithSetCallback(
  $key,
  $ttl,
  function ( $oldValue, &$ttl, array &$setOpts ) {
    $dbr = wfGetDB( DB_REPLICA );
    // Account for any snapshot/replica DB lag
    $setOpts += Database::getCacheSetOptions( $dbr );

    return $dbr->selectRow( ... );
  }
);

Snapshot-lag rejection should fallback to short-lived interim key

In the above case, if the replica server was selected with more than normally-tolerated high-lag (to the point where we'd normally automaticaly induce read-only mode, and perhaps already tried several different replicas), or if the slave was normal but there is snapshot lag due to the request being long-running like a maintenance script or slow job, then if a random code path hits down into getWithSet for some random key, we don't want to re-cache it like with the normal TTL.

This is expected to affect rare requests when we got a bad bad replica before we updated central consensus on which replica to pick, or long-running jobs where we don't want it to pre-cache stale values without a clear way to recover from it.

However, even in that scenario, it seems fair I think to still store it temporarily the same way we store "INTERIM" values in many other scenarios already with a reduced TTL. I'm not sure why that isn't triggered here.

Why is snapshot lag prevention affecting keys that didn't opt-in to DB options?

The Babel use case is not calling Database::getCacheSetOptions, so it's unclear to me why it is still going through a code path that essentially exists for the purpose of avoiding storing stale values from a database. Maybe this shouldn't be enabled by default.

I can imagine a case where maybe the interaction with the database is indirect. This actually does apply to the Babel case. It is hitting the MW API which in turn could be returning lagged values that Babel-on-the-other-side decided not to cache normally. In that case, we shouldn't whitewash that value in Babel-on-this-side. So if this is really important for some keys, maybe they can opt-in in some way to "use interim storage if regeneration takes longer than rep-lag tolerance". Or more long-term, perhaps we can create something like Database::getCacheSetOptions but for MWHTTPRequest. Note that MediaWiki already outputs the X-Database-Lagged: true header in these cases, so it'd be a matter of reading that in and informing $setOpts in the closure.

ArielGlenn subscribed.Feb 11 2020, 5:57 PM

jijiki subscribed.Feb 11 2020, 7:56 PM

Just to be very explicit: Babel in production goes through the caching layer, so it accepts edge-level 1-day caching. But more importantly, it's setting a 10 seconds timeout itself.

In general, I'd assume it makes more sense to ask the callback to decide what the "stale data acceptable" timeout is, maybe using MAX_READ_LAG as a default.

Anomie moved this task from Inbox to Triage Meeting Inbox on the Platform Engineering board.Feb 12 2020, 2:02 PM

jbond triaged this task as Medium priority.Feb 12 2020, 2:16 PM

In T244877#5873326, @Ladsgroup wrote:

Apparently this is a built-in and not-properly documented (I didn't know it existed) feature to avoid storing stale data (whatever that means). The default value for stale data is 7 seconds (I guess chosen arbitrary).

The 'since'/'lag' options are set by Database::getCacheSetOpts(), mentioned at https://www.mediawiki.org/wiki/Database_optimization#Database_selection_tips . This should probably be added/linked to https://www.mediawiki.org/wiki/Manual:Caching .

The values are based on the kind of lag spikes we saw on our DBs even after various optimizations (they've gone down a bit over time). Basically, the constants elude to "attainable" seeming SLAs, with clock skew wiggle-room and so on.

And anything taking longer than that if don't set lockTSE in their caching option would be regenerated all the time. It gets even better. The logic is buggy and if you even set lockTSE at this conditions. The WANCache decides to lower the TTL to one second (another arbitrary number). Which in our case it means the cache is completely useless and it would regenerate everything instead. This I think is a major factor in our Kobe Bryant outage and answers lots of questions.

this logic was added in I83505a59cce0119e456191c3100f7d97bc86bdbf in 2015.

It sets the "logical TTL" to 1 second, but not the memcached TTL that low. That means that, given lockTSE, you'd have one thread regenerating it every second (or less if it's slow) while other threads use the stale value. It's a consistency/performance tradeoff. In a CLI test script using a test key, you'd see a bunch of misses since the thread always gets/releases the lock. For a frequently used key accessed by web requests, various calls would just use the stale value and would not all be regenerating it.

Anyway, since Babel::getCachedUserLanguageInfo() doesn't have lockTSE, it's going down the "reject set()" path. That path has a weakness with values that systemically take a huge time to generate (as opposed to a fluke of it being slow or it being fast but in the middle/end of a long web request). Maybe it could take the add()-only path ("save only if no value exists") and have a lowered TTL. This would mean that stale values could "flicker" back to life even after purges, as long as the stale re-cache happens to be after the key tombstone expires. How stale the value would be would depend on how long a request could run before being killed as a timeout. It would still avoid stuck stale values due to a low TTL but would increase the risk of seeing stale values (and their extent of staleness). This is probably why the more relaxed case requires lockTSE, since that means that the caller already signalled to set()/getWithSetCallback() that staleness is less of a priority than normal.

In T244877#5873480, @Krinkle wrote:

The Babel use case is not calling Database::getCacheSetOptions, so it's unclear to me why it is still going through a code path that essentially exists for the purpose of avoiding storing stale values from a database. Maybe this shouldn't be enabled by default.

The default 'since' value is based on the timestamp just before callback invocation. If the data is based on an even older snapshot, then it should use Database::getCacheSetOptions() or comparable check. The default could have been the request start time, but that seemed too pessimistic. Having no default seemed too optimistic for general purpose use, however. It's just a best-guess "as of" timestamp.

I can imagine a case where maybe the interaction with the database is indirect. This actually does apply to the Babel case. It is hitting the MW API which in turn could be returning lagged values that Babel-on-the-other-side decided not to cache normally. In that case, we shouldn't whitewash that value in Babel-on-this-side. So if this is really important for some keys, maybe they can opt-in in some way to "use interim storage if regeneration takes longer than rep-lag tolerance". Or more long-term, perhaps we can create something like Database::getCacheSetOptions but for MWHTTPRequest. Note that MediaWiki already outputs the X-Database-Lagged: true header in these cases, so it'd be a matter of reading that in and informing $setOpts in the closure.

Yeah, interwikiTransclude() does something like that.

Change 579120 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] objectcache: make WANObjectCache::set() handle very slow regeneration

https://gerrit.wikimedia.org/r/579120

gerritbot added a project: Patch-For-Review.Mar 12 2020, 9:57 PM

• AMooney edited projects, added Platform Team Workboards (Clinic Duty Team); removed Platform Engineering.Mar 17 2020, 8:39 PM

• AMooney moved this task from Inbox to External Code Review Needed on the Platform Team Workboards (Clinic Duty Team) board.

Krinkle renamed this task from WanObjectCache::getWithSetCallback seems not to set objects when fetching data is slow to WANObjectCache::getWithSetCallback seems not to set objects when fetching data is slow.Apr 14 2020, 10:06 PM

Krinkle added a project: Wikimedia-Incident.Apr 14 2020, 10:34 PM

Krinkle moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

Change 579120 merged by jenkins-bot:
[mediawiki/core@master] objectcache: make WANObjectCache::set() handle very slow regeneration

https://gerrit.wikimedia.org/r/579120

ReleaseTaggerBot added a project: MW-1.35-notes (1.35.0-wmf.30; 2020-04-28).Apr 15 2020, 2:00 AM

Maintenance_bot removed a project: Patch-For-Review.Apr 15 2020, 2:10 AM

Krinkle closed this task as Resolved.Apr 17 2020, 4:47 PM

Krinkle assigned this task to aaron.

Krinkle edited projects, added Performance-Team; removed Platform Team Workboards (Clinic Duty Team).

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM