Page MenuHomePhabricator

Remote file thumb generation blocks serving a page
Closed, ResolvedPublic

Description

On wiki.wikimedia.it, I recently observe a very severe performance degradation on all pages which embed images from Wikimedia Commons: typically, time to first byte is over 9 seconds, while it's less than 1 on any page which doesn't include remote images. http://www.webpagetest.org/result/150929_GX_HFJ/1/details/

The local configuration for InstantCommons/remote files can probably be improved, but however bad a local configuration is this should not happen. Can't MediaWiki just serve the HTML whatever the status of the thumbnails is?

Event Timeline

Nemo_bis raised the priority of this task from to Needs Triage.
Nemo_bis updated the task description. (Show Details)
Nemo_bis added subscribers: Nemo_bis, Isarra.
Restricted Application added subscribers: Steinsplitter, Aklapper. · View Herald Transcript
Jdforrester-WMF set Security to None.

The local configuration for InstantCommons/remote files can probably be improved, but however bad a local configuration is this should not happen. Can't MediaWiki just serve the HTML whatever the status of the thumbnails is?

The output of the parser varies depending on
*The aspect ratio of the thumbnail
*The existence of the thumbnail (Whether its shown as a link)

Additionally, the current code path (if memory serves) stops parsing everytime it encounters a thumbnail, does an api request to commons, downloads the thumb, before continuing.

It would be more efficient if:

  • Getting the thumbnail details could be done in another pass, so it could be done in parallel. [would be quite a big change to parser]
  • One could use 404 rendering with commons so that it doesn't block on downloading the (potentially large) image from commons. [probably an easier change]
  • One could use 404 rendering with commons so that it doesn't block on downloading the (potentially large) image from commons. [probably an easier change]

Does instantcommons support 404 rendering at all currently?

thumb.php/thumb_handler.php doesn't like foreign repos, which is the major blocker for using 404-handlers with instant commons. (T27958)

If local thumb cache is disabled, MediaWiki might use the the foreign 404 handler (ie link directly to commons, don't download image). I'm not sure.

This configuration is equivalent to setting $wgGenerateThumbnailOnParse = false for the remote files and successfully hotlinks the thumbnail. If I manage to get it applied by WMIT's sysadmin, I'll let you know about the real-world speedup.

$wgForeignFileRepos[] = array(
	'class' => 'ForeignAPIRepo',
	'name' => 'commonshotlink',
	'apibase' => 'https://commons.wikimedia.org/w/api.php',
	'hashLevels' => 2,
	'url' => 'https://upload.wikimedia.org/wikipedia/commons', # <-- Changed from InstantCommons default
	'thumbUrl' => 'https://upload.wikimedia.org/wikipedia/commons/thumb', # <-- redundant, just for clarity
	'transformVia404' => true, # <-- Changed from InstantCommons default
	'fetchDescription' => true,
	'descriptionCacheExpiry' => 43200,
	'apiThumbCacheExpiry' => 24 * 3600, # If the cache actually works, maybe 1 month is better for small wikis. If the cache is broken, 0 is better because getThumbUrlFromCache() will just return the thumb URL without trying to download anything from Commons.
);

IMHO InstantCommons should hotlink by default. Alternatively, it should respect $wgGenerateThumbnailOnParse (currently it forces it to true) or provide another configuration setting to achieve the same result.

IIRC enabling transformVia404 will break thumbnailing due to T27958 (although it's been two years since I looked at that so who knows). That task has a patch attached which is (also IIRC) complete although probably needs lots of rebasing now.

IIRC enabling transformVia404 will break thumbnailing due to T27958 (although it's been two years since I looked at that so who knows). That task has a patch attached which is (also IIRC) complete although probably needs lots of rebasing now.

Yeah the thumb.php we have still assumes the files are in the LocalRepo afaik.

IIRC enabling transformVia404 will break thumbnailing due to T27958

Do you mean thumbnailing of local files? I didn't test that, but of course thumbnailing remote files is not needed if you hotlink them (it would rather be a bug if they were generated/downloaded).

Ah, ok, I missed the point of that code block. That should work, although it's fragile if you don't run the same MediaWiki version as Commons (e.g. T112546 will break it).

Ah, ok, I missed the point of that code block. That should work, although it's fragile if you don't run the same MediaWiki version as Commons (e.g. T112546 will break it).

That's an inherent fragility we already have, for instance you can't get thumbs of remote files for which you don't have an handler ([[File:Mozart Sonate (manuscript).djvu|thumb]] is rendered as a link only).

That's an inherent fragility we already have, for instance you can't get thumbs of remote files for which you don't have an handler ([[File:Mozart Sonate (manuscript).djvu|thumb]] is rendered as a link only).

Why is that a problem? If people actually care about these formats on their local sites it's usually pretty trivial to set up the handlers, and even if not users can still go upstream for actual previews and crap.

Why is that a problem? If people actually care about these formats on their local sites it's usually pretty trivial to set up the handlers, and even if not users can still go upstream for actual previews and crap.

Sure, it's what I was trying to say. :) It doesn't matter if some more incompatible thumbs appear.

Except that example is just some filetypes. Wouldn't this potentially apply to all files? That seems much more significant.

Except that example is just some filetypes. Wouldn't this potentially apply to all files? That seems much more significant.

It would also break all cached pages in Wikipedia, and break all hotlinked images on commons (hotlinking is somewhat encouraged by commons policy).

It will be interesting to watch that change be rolled out...

It would also break all cached pages in Wikipedia, and break all hotlinked images on commons (hotlinking is somewhat encouraged by commons policy).

Obviously there would be b/c support for filenames if such a change gets rolled out. (Also I think filename generation doesn't really come into play if a thumbnail file by that name exists on Swift?) Which also makes it a bad example - InstantCommons would only break when the local wiki is updated before the remote wiki, and with Commons that's unlikely to happen.

I don't see any real disadvantage to hotlinking like this, then.

Ok, I managed to reduce time to first byte from ~8400 to ~300 ms on a page with several images from Commons: http://www.webpagetest.org/video/compare.php?tests=151010_2A_JEE,150929_GX_HFJ

The configuration I ended up using is:

# $wgUseInstantCommons  = true;
$wgGenerateThumbnailOnParse = true;
$wgForeignFileRepos[] = array(
	'class' => 'ForeignAPIRepo',
	'name' => 'commonshotlink',
	'apibase' => 'https://commons.wikimedia.org/w/api.php',
	'hashLevels' => 2,
	'url' => 'https://upload.wikimedia.org/wikipedia/commons',
	'thumbUrl' => 'https://upload.wikimedia.org/wikipedia/commons/thumb',
	'transformVia404' => true,
	'fetchDescription' => true,
	'descriptionCacheExpiry' => 24 * 3600,
	'apiThumbCacheExpiry' => 24 * 3600,
);

The biggest gain was actually $wgMainCacheType = CACHE_ANYTHING; (and $wgSessionsInObjectCache = true; for local files!!! kudos @aaron) but I also observed that:

I think I'll send a patch later to make the above config default.

Change 251556 had a related patch set uploaded (by Nemo bis):
Hotlink InstantCommons images by default to speed up parsing

https://gerrit.wikimedia.org/r/251556

Change 251556 merged by jenkins-bot:
Hotlink InstantCommons images by default to speed up parsing

https://gerrit.wikimedia.org/r/251556

The URLs were changed to point to Commons, but transformVia404 has no effect since it is implemented in the parent File::transform() which is not called. Also, there is a bug in the cache expiry code, meaning the thumbnail is downloaded every time instead of once per month. So the full thumbnail is downloaded and stored every time a page containing a commons image is rendered.