Page MenuHomePhabricator

Fast stale ParserCache responses on PoolCounter contention
Closed, ResolvedPublic

Description

Problem

In news-driven events leading to simultaneous high edit rate and high view rate for a Wikipedia page, large numbers of PHP workers are found to be waiting in PoolCounter for parsing to complete.

Currently, when a page is edited, requests for that page will be queued until the new revision has been parsed and the new HTML is available in ParserCache. This queueing is done using a combination of ATS request coalescing and MediaWiki PoolCounter. If MediaWiki’s PoolCounter maximum queue size of 100 workers is exceeded, or if the timeout of 15 seconds expires, MediaWiki checks to see if there is a stale cache entry. If there is, it serves that instead, with zero s-maxage. If there is no stale cache entry, an error is shown. So the PoolCounter timeout and maxqueue cannot be reduced without risking error delivery.

Proposal

If MediaWiki receives a parser cache miss request for a page, and parsing is already underway (i.e. the PoolCounter lock acquisition would block), it should serve the stale page immediately. The s-maxage could be several seconds, instead of zero, so that ATS can serve stale data at a high request rate while parsing continues. MediaWiki should only wait for the PoolCounter lock if stale data is not available.

Implementation details

  • Add a timeout parameter to PoolCounter::acquireForAnyone(). If non-null, it overrides the default pool timeout.
    • Judging by locks.c line 129, the daemon already has a special case for timeout=0 and will synchronously respond with "TIMEOUT" if the worker slots are occupied.
    • The unused Redis implementation can easily be modified to override the timeout on a per-request basis, although it is probably not so efficient.
  • In PoolCounter::execute(), in the cacheable branch, initially call acquireForAnyone() with timeout=0.
    • If it fails (would have blocked), call fallback().
    • If fallback() succeeds, return the response.
    • If fallback() fails, call acquireForAnyone() with the default timeout and proceed as before.

Event Timeline

One thing that comes to mind is the code we have in MediaWiki.php for shortening the CDN TTL when the request used a known-lagged replica. This seems like something we'd want to trigger here as well.

Tim mentioned this in conversation earlier but I think it's worth writing down: we'd have to ensure that the post-save page view isn't stale. It would be pretty confusing if editors didn't see their own edits.

Tim mentioned this in conversation earlier but I think it's worth writing down: we'd have to ensure that the post-save page view isn't stale. It would be pretty confusing if editors didn't see their own edits.

Maybe you check for the cookie "UseCDNCache: false" , LBFactory::hasOrMadeRecentChanges(), and ChronologyProtecter::getTouched().

eprodromou added a subscriber: eprodromou.

This seems like a high priority to get right before we need it.

AMooney raised the priority of this task from High to Unbreak Now!.May 4 2020, 5:03 PM
AMooney lowered the priority of this task from Unbreak Now! to High.May 6 2020, 5:07 PM

Maybe you check for the cookie "UseCDNCache: false" , LBFactory::hasOrMadeRecentChanges(), and ChronologyProtecter::getTouched().

Suppose the interval between edits is smaller than the parse time. So the pool is always locked and every ParserCache entry generated on page view is stale before it is even written to the cache. A reasonable objective would be: a user making an edit wants to immediately see the stale cache entry if the timestamp on the stale entry is later than or equal to the timestamp of their edit. Reusing UseCDNCache would not meet this objective and would conflate the parse time with $wgDataCenterUpdateStickTTL. Using ChronologyProtector::getTouched() would apparently allow it, and from a point of view of elegance, it seems appropriate to reuse ChronologyProtector since the goals are essentially the same.

ChronologyProtector::getTouched() returns the request shutdown time of the edit, which may be later than the time at which the ParserOutput was generated. So some fudging may be needed if we want to ensure that the user never waits when the ParserCache entry generated by their own save request is available.

LBFactory::getChronologyProtectorTouched() requires the hostname of the master database server as the misleadingly named $dbName parameter. Its only caller in core and extensions, in FileAnnotations, gets this wrong, passing a domain ID instead. This is an unusual convention for LBFactory, so the simplest way to rectify this might be to just change the parameter to a domain ID, implicitly fixing the FileAnnotations feature.

Change 602284 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/core@master] [WIP] Fast stale ParserCache responses

https://gerrit.wikimedia.org/r/602284

Local test documentation:

I used PoolCounter 1.0.3 from a Debian package. Old but based on the changelog I think it is OK. Configured MW:

$wgPoolCounterConf = array(
	'ArticleView' => array(
		'class' => 'PoolCounter_Client',
		'timeout' => 10,
		'workers' => 1,
		'maxqueue' => 100,
		'fastStale' => true,
	),
);
$wgPoolCountClientConf['servers'] = [ '127.0.0.1' ];
$wgUseCdn = true;

Added the time and PID to the debug log

$wgMWLoggerDefaultSpi = [
	'class' => 'MediaWiki\Logger\MonologSpi',
	'args' => [ [
		'handlers' => [
			'stream' => [
				'class' => 'Monolog\Handler\StreamHandler',
				'args' => [ $wgDebugLogFile ],
				'formatter'=> 'line',
			],
		],
		'formatters' => [
			'line' => [
				'class' => \Monolog\Formatter\LineFormatter::class,
				'args' => [ "%datetime% $pid [%channel%] %message%\n" ],
			],
		],
		'processors' => [
			'wiki' => [
				'class' => \MediaWiki\Logger\Monolog\WikiProcessor::class,
            ],
		],
		'loggers' => [
			'@default' => [
				'processors' => [ 'wiki' ],
				'handlers' => [ 'stream' ],
			],
		],
	] ],
];

Busy loop Lua module takes 3.7 seconds to parse:

return {
  busy = function () 
     for i = 1, 1000000000 do
     end
     return "Busy";
  end
}

Invalidate cache by updating page_touched:

update page set page_touched=utc_timestamp()+0 where page_id=14;

eval.php client:

$multi = new MultiHttpClient([]);
$reqs = [];
for ( $i = 0; $i < 2; $i++ ) {
    $reqs[] = [ 'method' => 'GET', 'url' => 'http://mw.localhost/index.php/Busy'];
}
$responses = $multi->runMulti($reqs);
foreach ( $responses as $response ) {
    print_r($response['response']['headers']);
}

Results:

  • With fastStale=false the debug log shows one request doing the work while the other waits, as expected. CC:s-maxage is 18000 on both responses.
  • With fastStale=true the debug log shows one request doing the work while the other immediately returns a stale response, as expected. CC:s-maxage=10 is observed on the stale response.
  • Tested stale responses in the fastStale=false case by reducing the pool timeout to 1. CC:s-maxage=10 is observed on the stale response.
  • The Last-Modified header was initially incorrect, so I fixed it and confirmed the fix in both fast and slow modes. The Last-Modified header must be set to a time before page_touched, so that subsequent If-Modified-Since requests are given the updated output in a 200 response.

I also tried a test of ApiParse, but it calls WikiPage::getParserOutput(), which throws away staleness information. The stale response is probably returned as expected, but communicating the situation with the client would require refactoring.

Call site review:

  • Many things in core and extensions use PoolCounterWorkViaCallback, but only FlaggedRevs defines a fallback. It is similar to PoolWorkArticleView and may need chronology protection. It shares the "ArticleView" configuration with core and so the stale mode can't be separately configured.
  • The only subclasses of PoolCounterWork are PoolWorkArticleView and PoolCounterWorkViaCallback, nothing else defines a fallback by subclassing.

Change 602554 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[mediawiki/extensions/PoolCounter@master] Allow timeouts to be overridden on a per-request basis

https://gerrit.wikimedia.org/r/602554

Change 602284 merged by jenkins-bot:
[mediawiki/core@master] Fast stale ParserCache responses

https://gerrit.wikimedia.org/r/602284

Change 602554 merged by jenkins-bot:
[mediawiki/extensions/PoolCounter@master] Allow timeouts to be overridden on a per-request basis

https://gerrit.wikimedia.org/r/602554

Chronology protection is apparently unnecessary in FlaggedRevs. The affected function, parseStableRevisionPooled(), already cannot be used in the view-after-save case, since it would show the stable revision after you save an unstable edit. FlaggablePageView::injectPostEditURLParams() adds ?stable=0 to the view URL after save, which suppresses stable revision display by causing isDefaultPageView() to return false.

Mentioned in SAL (#wikimedia-operations) [2020-06-17T06:15:47Z] <tstarling@deploy1001> Synchronized wmf-config/PoolCounterSettings.php: test fast stale mode on testwiki T250248 (duration: 01m 17s)

I tested this on testwiki. Varnish coalescing prevents the stale case from being hit. So I'm effectively unable to reproduce the original report about threads waiting in PoolCounter.

My method is basically the same as the local test above, except that I also use eval.php to do a CDN purge:

$title = Title::newFromText('Test page long');
$url = $title->getFullUrl('', false, PROTO_HTTPS );
$multi = new MultiHttpClient([]);
$reqs = []; for ( $i = 0; $i < 3; $i++ ) { $reqs[] = [ 'method' => 'GET', 'url' => $url ]; }
$title->invalidateCache(); $title->purgeSquid();
$responses = $multi->runMulti($reqs);
foreach ( $responses as $response ) {print_r($response['response']['headers']);}

The result is that a single request is seen in mwlog1001:/srv/mw-log/testwiki.log . Response headers show the request always being forwarded to cp1077, so request coalescing is effectively global. It's possible to defeat coalescing by varying the URL, then the expected behaviour is seen, but I haven't figured out how to do it in a way that might be hit in production.

It took me a while to figure out that the issue is in Varnish, not ATS. Once I figured this out, I used varnishlog as a sanity check.

I tried sending Cookie headers containing my own login cookies. The requests are still coalesced, we see 1 request initially do the work while the other two wait.

It may be that the original report related to a pool type that does not include anything URL-related in the pool key, like search.

I tested this on testwiki. Varnish coalescing prevents the stale case from being hit. So I'm effectively unable to reproduce the original report about threads waiting in PoolCounter.

My method is basically the same as the local test above, except that I also use eval.php to do a CDN purge:

$title = Title::newFromText('Test page long');
$url = $title->getFullUrl('', false, PROTO_HTTPS );
$multi = new MultiHttpClient([]);
$reqs = []; for ( $i = 0; $i < 3; $i++ ) { $reqs[] = [ 'method' => 'GET', 'url' => $url ]; }
$title->invalidateCache(); $title->purgeSquid();
$responses = $multi->runMulti($reqs);
foreach ( $responses as $response ) {print_r($response['response']['headers']);}

The result is that a single request is seen in mwlog1001:/srv/mw-log/testwiki.log . Response headers show the request always being forwarded to cp1077, so request coalescing is effectively global. It's possible to defeat coalescing by varying the URL, then the expected behaviour is seen, but I haven't figured out how to do it in a way that might be hit in production.

It took me a while to figure out that the issue is in Varnish, not ATS. Once I figured this out, I used varnishlog as a sanity check.

I tried sending Cookie headers containing my own login cookies. The requests are still coalesced, we see 1 request initially do the work while the other two wait.

It may be that the original report related to a pool type that does not include anything URL-related in the pool key, like search.

The only way you defeat coalescing is if you come from different source IPs.

What happens to a request coming from a client is:

  1. lands on the geolocalized lvs IP
  2. given we want clients to preserve their TLS sessions, we use source-hash as a load-balancing algorhythm, meaning that your IP will tentatively always reach the same real server.
  3. On the real server, both TLS termination and fronted caching (Varnish) happen
  4. For an uncacheable request or a cache miss, the request will be sent to one specific ATS backend (via URL hashing)
  5. If the request is uncacheable, it will be sent to the backend

So to change what varnish server responds to you, you just need to change the origin IP (which was happening in the original report IIRC).

@tstarling any updated estimated completion date?

I was able to confirm the expected behaviour using a script that waits until a specified time before beginning a request.

1<?php
2
3require __DIR__ . '/Maintenance.php';
4
5class T250248_Inject extends Maintenance {
6 public function execute() {
7 $startTime = strtotime( $this->getArg() );
8 if ( $startTime < time() ) {
9 print "Cannot start at the specified time\n";
10 return 1;
11 }
12
13 $title = Title::newFromText('Test page long');
14 $url = $title->getFullUrl('', false, PROTO_HTTPS );
15 $multi = new MultiHttpClient([]);
16 $reqs[] = [
17 'method' => 'GET',
18 'url' => $url,
19 'headers' => [
20 'Cookie' => 'fooSession=' . wfRandomString( 8 )
21 ]
22 ];
23 $title->invalidateCache();
24 if ( false ) {
25 $title->purgeSquid();
26 }
27 time_sleep_until( $startTime );
28 $responses = $multi->runMulti($reqs);
29 print "Elapsed time: " . (microtime( true ) - $startTime) . "\n";
30 foreach ( $responses as $response ) {
31 print_r($response['response']['headers']);
32 }
33 }
34}
35
36$maintClass = 'T250248_Inject';
37require __DIR__ . '/doMaintenance.php';

By running this maintenance script on multiple servers (mw1286, mw1287, mw1288) with the same start timestamp, I was able to confirm that fast stale mode is activated on one of the three requests. The pool size is 2, so the first 2 requests acquire a lock and start parsing, while the third returns a stale response.

Instead of purging, I used a random cookie header, on the basis of my reading of Varnish's HSH_Lookup() which suggests that coalescing will be skipped if there is an existing cache object with a Vary header and the new request is a Vary mismatch. But it wasn't necessary in the end since my three requests were served by different servers.

Using varnishlog I confirmed that CC:s-maxage=10 is received, although I didn't confirm a cache hit on such an object.

Change 618646 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/mediawiki-config@master] Enable fastStale mode on all wikis

https://gerrit.wikimedia.org/r/618646

Change 618646 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable fastStale mode on all wikis

https://gerrit.wikimedia.org/r/618646

Mentioned in SAL (#wikimedia-operations) [2020-08-11T01:59:52Z] <tstarling@deploy1001> Synchronized wmf-config/PoolCounterSettings.php: enabling fast stale mode T250248 (duration: 00m 58s)