Page MenuHomePhabricator

Implement DC-local cache failure limiter in Thumbor
Closed, ResolvedPublic

Description

This is the logic to reproduce from thumb.php:

	global $wgAttemptFailureEpoch;

	$cache = ObjectCache::getLocalClusterInstance();
	$key = $cache->makeKey(
		'attempt-failures',
		$wgAttemptFailureEpoch,
		$file->getRepo()->getName(),
		$file->getSha1(),
		md5( $thumbName )
	);

	// Check if this file keeps failing to render
	if ( $cache->get( $key ) >= 4 ) {
		return [ false, wfMessage( 'thumbnail_image-failure-limit', 4 ) ];
	}

	$done = false;
	// Record failures on PHP fatals in addition to caching exceptions
	register_shutdown_function( function () use ( $cache, &$done, $key ) {
		if ( !$done ) { // transform() gave a fatal
			// Randomize TTL to reduce stampedes
			$cache->incrWithInit( $key, $cache::TTL_HOUR + mt_rand( 0, 300 ) );
		}
	} );

Details

Related Gerrit Patches:
operations/debs/python-thumbor-wikimedia : masterUpgrade to 0.1.39
operations/puppet : productionConsolidate Performance team's root access
operations/debs/python-thumbor-wikimedia : masterUpgrade to 0.1.36

Event Timeline

Gilles created this task.Nov 18 2016, 4:41 PM
Gilles renamed this task from Implement DC-local cache failure limiter to Implement DC-local cache failure limiter in Thumbor.Nov 23 2016, 12:14 PM
elukey removed a subscriber: elukey.Nov 24 2016, 11:32 AM

Now that I've looked into PoolCounter more, I think I can use it to reproduce this feature, using the following settings:

  • 5 workers.
  • Every time an error is encountered, the current thumbor instance acquires a lock for itself on the failing filename, and sets a timed callback to execute an hour later (+ fuzzing to avoid stampede). An hour later, that callback runs and releases the lock.
  • A very low timeout because we want to bail ASAP when all 5 workers are busy (i.e. this file failed 5+ times in the last hour). And for the same reason, no queue. In fact not having any queue probably guarantees erroring right away when the 5 workers are busy, making the timeout value kind of irrelevant.

Keeping the lock for a while should be safe because should a thumbor process who holds the lock die, the lock is released. Which means we might attempt to render a failing thumbnail a bit more than this throttling mechanism would normally allow, not a big deal.

Now it's an open question as to whether this "novel" use of PoolCounter might put some stress on it, keeping locks for so long. As I understand it, PoolCounter was designed to only hold the lock for as long as an actual job had to run, not for an artificially long time, acting as another kind of timeout. But we're not talking about a huge amount of locks here, I think it should be fine. We'll probably want to deploy it separately from the other uses of PoolCounter by Thumbor, though, to get a sense of the impact on PoolCounter.

Gilles added a comment.Jan 2 2017, 3:25 PM

I now think that poolcounter is unsuitable for this because processing individual requests can take several seconds and using PC for failure throttling would mean that requests are "guilty until proven innocent". This would fail in situation where bursts of legitimate requests happen and the "failure lock" wouldn't get released fast enough.

I think that inherently failure throttling can only be counted at the time the failure happens, without locking.

Gilles updated the task description. (Show Details)Mar 7 2017, 9:24 AM

Change 342809 had a related patch set uploaded (by Gilles):
[operations/debs/python-thumbor-wikimedia] Upgrade to 0.1.36

https://gerrit.wikimedia.org/r/342809

Change 342811 had a related patch set uploaded (by Gilles):
[operations/puppet] Enable memcache-based Thumbor broken thumbnail throttling

https://gerrit.wikimedia.org/r/342811

Change 342809 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia] Upgrade to 0.1.36

https://gerrit.wikimedia.org/r/342809

Change 343262 had a related patch set uploaded (by Gilles):
[operations/puppet] Consolidate Performance team's root access

https://gerrit.wikimedia.org/r/343262

Change 343262 merged by Filippo Giunchedi:
[operations/puppet] Consolidate Performance team's root access

https://gerrit.wikimedia.org/r/343262

Krinkle removed a subscriber: Krinkle.Apr 26 2017, 10:39 PM

Change 355429 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 0.1.39

https://gerrit.wikimedia.org/r/355429

Change 355429 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 0.1.39

https://gerrit.wikimedia.org/r/355429

Mentioned in SAL (#wikimedia-operations) [2017-05-24T15:00:58Z] <godog> deploy thumbor 0.1.39 for memcache-based throttling - T151065

fgiunchedi moved this task from Backlog to Radar on the User-fgiunchedi board.May 26 2017, 2:50 PM
Gilles closed this task as Resolved.May 29 2017, 12:47 PM