Page MenuHomePhabricator

Thumbnail urls should be versioned and sent with Expires headers
Open, Stalled, MediumPublic

Description

Author: sergey.chernyshev

Description:
Image URL with timestamp patch

I'm optimizing performance of MediaWiki instances and one of the issues I came across is that images in MediaWiki don't change their URLs over time as they change so it's impossible to set far future expires headers for them to keep them firmly in browsers caches.

Here's the link explaining this particular issue: http://developer.yahoo.com/performance/rules.html#expires

You can see this issue in action here: http://performance.webpagetest.org:8080/result/090218_132826127ab7f254499631e3e688b24b/ (simple two-run test of http://en.wikipedia.org/wiki/Hilary_Clinton page) - notice that on repeat run all image requests are sent again even though images didn't change so we get 55 extra requests with 304 responses which requires 4 more connections to the commons server (see "Connection View" section below) all of which could be avoided. This might get even worse if we'll test consequent views with pages sharing only some images - in this case loading images after the ones that were already requested will be blocked.

I didn't try to calculate traffic savings (can be significant even though it's only headers that are being sent), but it can be done based on some statistics.

The good news is that MediaWiki already has control over the versioning of uploaded files (which is most important for images) so the solution would be to just make unique query string for each version of the image.

It looks like solutions for local file store and remote stores might be different, but I created a patch that relies on getTimestamp to be implemented accordingly in each subclass (LocalFile.php / ForeignAPIFile.php and so on).

Another, much "cleaner", approach would be to use file revision number instead of timestamp, but it'll require more knowledge of file store implementation which I lack. It might be heavier on CPU though as it'll require getting history from the database.

Anyway, I'm attaching a patch that already works for local file repository where timestamp implementation works fine.

You can see result of this patch here: http://performance.webpagetest.org:8080/result/090219_289bbf4e150b039459abe3ba3d3ce148/ (notice, that on second run only the page is requested).

If it all sounds right, I can apply this patch to the tree.

Sergey

URL: http://performance.webpagetest.org:8080/result/090218_132826127ab7f254499631e3e688b24b/
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=44310

Attached:

Details

Reference
bz17577

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 10:33 PM
bzimport set Reference to bz17577.
bzimport added a subscriber: Unknown Object (MLST).

sergey.chernyshev wrote:

Yep, patch doesn't include web server configuration for expiration headers.
Simple .htaccess like can be put into images/ folder (if Apache has AllowOverride Indexes for it):

ExpiresActive on
ExpiresDefault A25920000

Spiffy!

Offhand looks good, though would want to double-check there's no conflicts with remote repos and the on-demand thumbnailing.

Tim, can you take a peek at this today and see if there's any issues there? Thanks!

ayg wrote:

Does Squid currently get purged on image reupload? I suppose it must, to deal with image links that have no size specified. I had always assumed Squid is why we didn't change image URLs, but on reflection, it seems unlikely to be a big deal to do such purges occasionally.

If this works with Squid and file cache, it should probably be on by default. (Why doesn't file cache hook into Squid's purge mechanism?)

Will squids purge File:Foo.jpg?timestamp=19700101000000 entry when Foo.jpg is reuploaded?

Are pages using images on remote-repos correctly purged on image reupload?
(I think the problems of bug 1394 complicate it)
Infinite expiry images plus squids serving pages pointing to old images...

Does Squid currently get purged on image reupload?

Currently the plain page view URL does get purged from local Squids, however any *client* that has cached the image doesn't get any such notification. So, either the browser has to go back to hit the server every time it shows it to check if it's changed (slow!), or it speculatively caches it for some amount of time with the risk of showing an outdated version.

You can see this effect when you upload a new version of an image and see the old one sitting there on the File: page until you refresh.

Changing the URL with a timestamp would mean that any page which has been updated will use the updated URL, giving you the updated image version when you view it.

Are pages using images on remote-repos correctly purged on image reupload?

Nope, which is an issue to consider. There's not currently any registry of remote use, so the wiki doesn't know who to send purges to. (This would not be too hard to implement internally for DB-based repos so Commons could update the other Wikimedia sites, but would be much trickier for third-party sites using us via an API repo).

demon added a comment.Feb 20 2009, 5:13 PM

(In reply to comment #5)

Are pages using images on remote-repos correctly purged on image reupload?

Nope, which is an issue to consider. There's not currently any registry of
remote use, so the wiki doesn't know who to send purges to. (This would not be
too hard to implement internally for DB-based repos so Commons could update the
other Wikimedia sites, but would be much trickier for third-party sites using
us via an API repo).

Which unless we had a dedicated action=repo or similar, the API has no way of distinguishing between normal API requests and a request to act as a repo.

sergey.chernyshev wrote:

So what do we do with this? Can this patch be localized so that stores that can benefit from this could utilize this feature?

sergey.chernyshev wrote:

Not related to the solution but useful to measure future performance optimizations:
http://www.showslow.com/details/?url=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FHillary_Clinton

sergey.chernyshev wrote:

Just to illustrate results, here's the test for TechPresentations with this patch applied (it uses local repository): http://www.webpagetest.org/result/091210_3H74/

sergey.chernyshev wrote:

It's been 2 years since I provided initial patch, but Hillary Clinton still sends 304s for static assets: http://www.webpagetest.org/result/110225_EY_5e420956c8cf54450c47902cc4e82be0/1/details/cached/

You're loosing user experience and traffic.

ayg wrote:

This needs to be reviewed by someone who understands our Squid setup, like Tim or Brion. I don't think it needs a config option, it should just always be enabled, but we need to make sure the right Squid URLs are purged for it to work on Wikimedia. You're right that the status quo is unreasonable.

sergey.chernyshev wrote:

I think last time this was discussed there was another issue - that you guys have remote repository with static assets (uploads.wikimedia.org) while smaller MW installs can use local system to determine the version number.

In any case, it's worth implementing in both cases.

Sergey

No, it's not a problem for wikimedia since it is -for now- nfs mounted.

It is a problem for people using us as a remote repository.

sergey.chernyshev wrote:

Basically, there are a few ways to get versions:

  • from asset itself
    • ideally crc32/md5 of content (it's actually pretty fast)
    • or modification time (which is not very good)
  • from meta-data (in case of MW, it's file revision number)

Ideally, it should be part of the file name, e.g.
http://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Hillary_Rodham_Clinton_Signature.svg/rev125/128px-Hillary_Rodham_Clinton_Signature.svg.png

Notice "rev125" between last two slashes. It can be a real folder if you prefer to keep old files or just pseudo-folder which can only be used for cache busting.

Cache busting is a must as we have assets infinitely stored in all possible caches, not only in SQuid. Don't know if you need to tell SQuid to clean old URLs or they'll be just LRU-ed later.

BTW, all this goes for skin files as well - it should probably be done differently though - as build script or post-commit hook or something like SVN-Assets tool that checks repository revision or hash of the file and generates file names accordingly.

Sergey

sumanah wrote:

Adding the need-review keyword to indicate that this patch still needs to be reviewed. Thanks for the patch and sorry for the wait, Sergey.

sumanah wrote:

Sergey, I'm sorry, but because so much time has passed since you submitted your patch, trunk has changed and your patch no longer applies cleanly. If the problem's still happening, would you mind updating it and then letting me know? I'll then get a reviewer for it. Thanks.

It doesn't make any difference to me whether or not the patch is updated. There's not a significant amount of code review here, it's mostly about the idea.

sergey.chernyshev wrote:

Glad you guys are on it - I don't think I can dig into the guts of MW again to get it working right, but Tim is correct, it's not much code, just simple stuff.

Still, if you can use real or pseudo-folders for file names, that would be even better (query strings might not be as good in terms of caches like your Squids and external caches too).

BTW, old way and new way can co-exists if there are worries about some instances not being able to support remote repos - all you need to do is set up infinite expires only on versioned URLs and keep regular URLs intact.

(In reply to comment #19)

It's been 3 years already, but Hillary is still very slow:
http://www.webpagetest.org/result/120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9/
Second run is not significantly more efficient then firs one:
http://www.webpagetest.org/video/compare.php?tests=120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9-r:1-c:0,120302_AS_8bd3281d5ad4e661f4a2ad91d0a006b9-r:1-c:1

Well second run being almost same speed is mostly due to counting ajax from the banner load. If you discount the ajax, it'd be roughly 6.2 seconds vs 5.5 seconds. If you don't count one image that took insanely long to return a 304 (which could just be a rare occurrence. Or it could be common place, I don't really know), the comparision becomes 6.2 seconds vs 3.6 seconds. Hence the speedup by fixing this bug might not be as much as that test would lead you to believe (It is still probably something that is fairy significant though, assuming it can be done effectively)

sergey.chernyshev wrote:

Actually, I'm looking at render times and not on load events.

Adding performance keyword, and removing Tim since he's not specifically looking at this. Aaron Schulz may have an idea or two about where we should go with this.

TheDJ added a comment.Apr 25 2013, 9:02 AM

We really should look at this one again. If the WMF infra is so problematic, then perhaps we should wrap it in a conditional, so that at least it will improve functionality for 'the rest of them' ?

It could potentially fix the problem where people upload a new version of an image and some browsers don't purge the cached copy of the thumbnail by themselves. (We still need to tell some people to bypass their browser cache after 'upload new version' at times, even though I can't really see why a browser would 'not' send a request with the current setup. Perhaps some browsers try to be too smart if there is no Cache-Control:must-revalidate and set an hidden max-age ?).

sergey.chernyshev wrote:

Checking back 4.5 years later, are you guys still interested in saving traffic and increasing performance of web pages?

Any way I can help with his? Refreshing everybody's memory? Explaining the effect this can have on users and systems?

I'll be happy to do so - can even take a day or two of vacation to help.

(In reply to comment #24)

Checking back 4.5 years later, are you guys still interested in saving
traffic
and increasing performance of web pages?
Any way I can help with his? Refreshing everybody's memory? Explaining the
effect this can have on users and systems?
I'll be happy to do so - can even take a day or two of vacation to help.

Actually there has been recent interest in this sort of thing, but for different reasons (easier management of purging cache on server side. Obviously your reasons are good too)

sergey.chernyshev wrote:

Great, I'll be happy to see this implemented.

Krinkle renamed this task from Image urls should have far future expires to Thumbnail urls should be versioned and sent with Expires headers.Jun 2 2015, 11:28 PM
Krinkle removed a project: MediaWiki-Core-Team.
Krinkle set Security to None.
Krinkle removed a subscriber: Unknown Object (MLST).
Restricted Application added a subscriber: Matanya. · View Herald TranscriptJul 30 2015, 12:34 AM
Peter added a subscriber: Peter.Aug 19 2015, 8:22 PM
Restricted Application added a subscriber: Steinsplitter. · View Herald TranscriptAug 19 2015, 8:22 PM
Krinkle raised the priority of this task from Low to Medium.Sep 4 2015, 2:47 AM
Krinkle updated the task description. (Show Details)
Jdforrester-WMF moved this task from Untriaged to Backlog on the Multimedia board.Sep 4 2015, 6:41 PM
Restricted Application added a project: Commons. · View Herald TranscriptNov 12 2015, 1:05 PM
zhuyifei1999 moved this task from Incoming to Backlog on the Commons board.Nov 13 2015, 10:13 AM
Meno25 removed a subscriber: Meno25.Feb 22 2016, 5:58 PM
Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptDec 6 2016, 12:41 AM

@sergey.chernyshev Thank you for trying to drive this forward for so many years. I run MediaWiki for a side project (sarna.net) and not having versioned URLs for images has always bothered me. Just happened to stumble upon this issue finally! Versioned image URLs will let me offload the images to a CDN better (with a 2+year expiry).

Your patch more-or-less works with the current source. Here's a patched based off MW 1.28:

For anyone else following along at home, there are a few other issues I've found tracking something similar:

Unfortunately the first change doesn't affect all image URLs, as it only adds the SHA1 to thumbnail URLs. My site has a lot of pages where the images are small and are included directly, so I've actually turned on both supportsSha1URLs and this patch for &timestamp= so both image and thumbnail URLs are versioned. I'm currently testing this to make sure it will work when content changes.

It's also quite possible that the SHA1 could be added to regular image URLs (and not need &timestamp=), but I haven't tested that.

For what it's worth, I have a single-server setup that relies heavily on a CDN to offload assets and improve performance:

  • HTML pages are served from the server
  • JS and CSS via ResourceLoader (load.php) are served from the CDN. I've also hacked in a siterev=$wgCacheEpoch to all load.php URLs so they can all have a 2 year expiry (and I manually clear the file cache on any MW/skin/extension/JS/CSS changes)
  • Images and Thumbnails can now be served from the CDN once they have versioned URLs

So that is why versioned URLs really help me -- I can rely on the CDN to do the heavy lifting of JS, CSS and Images.

Anyways, hope this helps someone else out. There is a lot of good discussion in T66214/T149847, though the design is still being worked in. In the meantime, the above two changes are currently working (being tested) for my needs.

Imarlier changed the task status from Open to Stalled.Jun 20 2018, 9:08 AM

I use a custom-made extension that adds the file timestamp to the URLs (thumbnails and original file). If anyone is interested, the code is here and can be seen on wikidex.net

gpaumier removed a subscriber: gpaumier.Jun 20 2018, 7:55 PM

I use a custom-made extension that adds the file timestamp to the URLs (thumbnails and original file). If anyone is interested, the code is here and can be seen on wikidex.net

Ciencia, I'm extremely interested in trying your WikiDexFileRepository extension, but there isn't any installation and server configuration instructions. If you're available to help me get this implemented, I make a pull request for you that contains the instructions that worked for me. Got time to help?

...

I've updated the README.md of the repository with sample configuration