Page MenuHomePhabricator

Implement DC-local cache failure limiter in Thumbor
Closed, ResolvedPublic


This is the logic to reproduce from thumb.php:

	global $wgAttemptFailureEpoch;

	$cache = ObjectCache::getLocalClusterInstance();
	$key = $cache->makeKey(
		md5( $thumbName )

	// Check if this file keeps failing to render
	if ( $cache->get( $key ) >= 4 ) {
		return [ false, wfMessage( 'thumbnail_image-failure-limit', 4 ) ];

	$done = false;
	// Record failures on PHP fatals in addition to caching exceptions
	register_shutdown_function( function () use ( $cache, &$done, $key ) {
		if ( !$done ) { // transform() gave a fatal
			// Randomize TTL to reduce stampedes
			$cache->incrWithInit( $key, $cache::TTL_HOUR + mt_rand( 0, 300 ) );
	} );

Event Timeline

Gilles renamed this task from Implement DC-local cache failure limiter to Implement DC-local cache failure limiter in Thumbor.Nov 23 2016, 12:14 PM

Now that I've looked into PoolCounter more, I think I can use it to reproduce this feature, using the following settings:

  • 5 workers.
  • Every time an error is encountered, the current thumbor instance acquires a lock for itself on the failing filename, and sets a timed callback to execute an hour later (+ fuzzing to avoid stampede). An hour later, that callback runs and releases the lock.
  • A very low timeout because we want to bail ASAP when all 5 workers are busy (i.e. this file failed 5+ times in the last hour). And for the same reason, no queue. In fact not having any queue probably guarantees erroring right away when the 5 workers are busy, making the timeout value kind of irrelevant.

Keeping the lock for a while should be safe because should a thumbor process who holds the lock die, the lock is released. Which means we might attempt to render a failing thumbnail a bit more than this throttling mechanism would normally allow, not a big deal.

Now it's an open question as to whether this "novel" use of PoolCounter might put some stress on it, keeping locks for so long. As I understand it, PoolCounter was designed to only hold the lock for as long as an actual job had to run, not for an artificially long time, acting as another kind of timeout. But we're not talking about a huge amount of locks here, I think it should be fine. We'll probably want to deploy it separately from the other uses of PoolCounter by Thumbor, though, to get a sense of the impact on PoolCounter.

I now think that poolcounter is unsuitable for this because processing individual requests can take several seconds and using PC for failure throttling would mean that requests are "guilty until proven innocent". This would fail in situation where bursts of legitimate requests happen and the "failure lock" wouldn't get released fast enough.

I think that inherently failure throttling can only be counted at the time the failure happens, without locking.

Change 342809 had a related patch set uploaded (by Gilles):
[operations/debs/python-thumbor-wikimedia] Upgrade to 0.1.36

Change 342811 had a related patch set uploaded (by Gilles):
[operations/puppet] Enable memcache-based Thumbor broken thumbnail throttling

Change 342809 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia] Upgrade to 0.1.36

Change 343262 had a related patch set uploaded (by Gilles):
[operations/puppet] Consolidate Performance team's root access

Change 343262 merged by Filippo Giunchedi:
[operations/puppet] Consolidate Performance team's root access

Change 355429 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 0.1.39

Change 355429 merged by Filippo Giunchedi:
[operations/debs/python-thumbor-wikimedia@master] Upgrade to 0.1.39

Mentioned in SAL (#wikimedia-operations) [2017-05-24T15:00:58Z] <godog> deploy thumbor 0.1.39 for memcache-based throttling - T151065