Page MenuHomePhabricator

ResourceLoader breaks Save Page functionality in browsers
Open, MediumPublic

Description

If you open a Wikipedia page in Chrome, and choose Save as... / Web Page, Complete from the right-click menu, then try to open to saved page, it will be barely readable (almost all CSS will break). This is due to ResourceLoader serving all CSS with a load.php filename - Chrome will save it as such, and then presumably use the file extension to guess content-type, and thus fail to interpret it as CSS. Given our focus on offline usage, breaking the most intuitive way of saving a page for offline usage in the most popular browser seems like a bad thing, and it seems reasonably easy to fix - just use URLs with the appropriate extension, ie. load.php/foo.css and load.php/bar.js (with a feature flag to disable in case the server does not support PATH_INFO).

(Via StackOverflow.)

Event Timeline

Tgr created this task.Jul 6 2019, 10:03 PM
Restricted Application added a project: Performance-Team. · View Herald TranscriptJul 6 2019, 10:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kchapman assigned this task to Krinkle.Jul 8 2019, 7:49 PM
kchapman moved this task from Inbox to Doing on the Performance-Team board.
Krinkle triaged this task as Medium priority.Jul 8 2019, 7:50 PM

As far as I know, this used to work and we haven't changed anything in this regard.

Browsers tend to create a directory when saving a page, in which each known subresource is given a unique and flat file name, with the HTML file rewritten to match those names. These new names are different because url schemes and query parameters tend to not be supported in file names.

The browser would e.g. turns load.php?startup and load.php?site into load.php_1 and load.php_2. I don't recall whether browsers also append a suffix (based on the content-type header), or whether they just discard that information and instead allow any file type to be used at run-time. The latter is how it used to be for <script>, and this is still the case for other resources. E.g. <link rel=stylesheet href=foo.txt> and <img src=foo.txt> will work fine if they contain stylesheets and image binary.

As far as I know, adding a suffix here shouldn't make a difference, unless there is a server in-between that would use that suffix to synthesise a Content-Type header. While Nginx and Apache would do that, afaik the file-server that browsers use doesn't (except for the navigation URL's file, to decide how to render it).

What has changed, is that Chrome (and other browsers) no longer allow execution of JavaScript code that was served as text/plain. Whereas they do still allow this for stylesheets, images, and video.

In trying this out locally, I can confirm this:

Wikipedia_files
load.php
load(1).php
load(2).php
load(3).php
31px-Commons-logo.svg.png
35px-Mediawiki-logo.png
35px-Wikibooks-logo.svg.png
35px-Wikiquote-logo.svg.png
35px-Wikisource-logo.svg.png
35px-Wikispecies-logo.svg.png
35px-Wikivoyage-Logo-v3-icon.svg.png
35px-Wiktionary-logo-v2.svg.png
41px-Wikiversity_logo_2017.svg.png
47px-Wikidata-logo.svg.png
51px-Wikinews-logo.svg.png
120px-BlemannianabudWaite.jpg
136px-Jaya_Bachchan_still16_(square_crop).jpg
160px-Kyoto_Animation_Studio_1a.jpg
186px-Mary_rose_2019_sideish.JPG
192px-Armia_Czerwona,_Wehrmacht_22.09.1939_wspólna_parada_(wide_crop).jpg
310px-Margaret_Hamilton_-_restoration.jpg
Wikimedia_Community_Logo.svg.png
poweredby_mediawiki_88x31.png
wikimedia-button.png

But, adding ".js" to the JS files did seem to fix it. I guess this means that Chrome maybe has an exemption in its rendering for the file:/// protocol where it whitelists text files whose urls end with ".js". Perhaps we can appeal to Chrome to utilise fact when saving the files from accepted script tags, and add that same suffix when mangling the names.

Alternatively, I wonder if Content-Disposition would work here. That might be a cleaner way.

Btw, I'm also noticing that all images all broken from a saved page due to srcset not being rewritten by the Save Page feature in the browser, and thus the protocol-relative urls take on a different meaning when left alone.

Krinkle removed Krinkle as the assignee of this task.Jul 26 2019, 4:37 PM
Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.
Krinkle added a subscriber: Krinkle.

I'd recommend reporting this upstream to Chrome and then referencing it in a comment on our tracking issue at https://bugs.chromium.org/p/chromium/issues/detail?id=463348.

Also, are Firefox and Safari any better?

SBisson added a subscriber: SBisson.

Cleaning up the New-Readers project. I don't think this task belongs there. Let me know if I'm missing anything.

Krinkle added a comment.EditedMay 4 2020, 5:03 PM

Chrome:

  • Chrome saves stylesheets by the URL basename, e.g. load (1).php and load (2).php.
  • When viewing Chrome's own copy, it refuses to processs these same files as stylesheets.

Firefox:

  • Firefox saves stylesheets with a suffix based on the effective content type, e.g. load.css.
  • When viewing Firefox's own copy, it processes these fine.
  • When viewing Chrome's copy in Firefox, it processes those files fine as well.

Chrome:

  • When viewing Firefox's copy in Chrome, it processes those files fine as well.

So clearly a bug in Chrome I guess :)

Krinkle added a project: Upstream.

Note that the images are broken as well. This task is mainly about the stylesheets being broken in Chrome, but I'll briefly mention why the images are broken.

The browser (both Firefox and Chrome), store all styles and images (regardless of domain they are from) in the saved directory. But, for images it is only saving the file referenced in <img src>. As with the stylesheets, the browser also rewrites the HTML so that these attributes point to the local file instead. The files referenced in <img srcset> are ignored and left as-is.

Given we use protocol-relative urls for our images, these thus don't work once processed in the context of an offline file. This too is an upstream browser bug, considering that in general they already have logic to rewrite protocol urls and correctly do this already for URLs it found in the CSS code, and in <img src>. It just isn't account for the ones in <img srcset> yet. On the other hand, this is also something we could relatively easily fix on our end. Ref T54253.

Tgr added a comment.Mon, May 11, 11:05 AM

There is also an older upstream report at #927948.

Btw, I'm also noticing that all images all broken from a saved page due to srcset not being rewritten by the Save Page feature in the browser, and thus the protocol-relative urls take on a different meaning when left alone.

That seems to be #962040.