Page MenuHomePhabricator

ResourceLoader breaks Save Page functionality in browsers
Open, NormalPublic

Description

If you open a Wikipedia page in Chrome, and choose Save as... / Web Page, Complete from the right-click menu, then try to open to saved page, it will be barely readable (almost all CSS will break). This is due to ResourceLoader serving all CSS with a load.php filename - Chrome will save it as such, and then presumably use the file extension to guess content-type, and thus fail to interpret it as CSS. Given our focus on offline usage, breaking the most intuitive way of saving a page for offline usage in the most popular browser seems like a bad thing, and it seems reasonably easy to fix - just use URLs with the appropriate extension, ie. load.php/foo.css and load.php/bar.js (with a feature flag to disable in case the server does not support PATH_INFO).

(Via StackOverflow.)

Event Timeline

Tgr created this task.Jul 6 2019, 10:03 PM
Restricted Application added a project: Performance-Team. · View Herald TranscriptJul 6 2019, 10:03 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
kchapman assigned this task to Krinkle.Jul 8 2019, 7:49 PM
kchapman moved this task from Inbox to Doing on the Performance-Team board.
Krinkle triaged this task as Normal priority.Jul 8 2019, 7:50 PM

As far as I know, this used to work and we haven't changed anything in this regard.

Browsers tend to create a directory when saving a page, in which each known subresource is given a unique and flat file name, with the HTML file rewritten to match those names. These new names are different because url schemes and query parameters tend to not be supported in file names.

The browser would e.g. turns load.php?startup and load.php?site into load.php_1 and load.php_2. I don't recall whether browsers also append a suffix (based on the content-type header), or whether they just discard that information and instead allow any file type to be used at run-time. The latter is how it used to be for <script>, and this is still the case for other resources. E.g. <link rel=stylesheet href=foo.txt> and <img src=foo.txt> will work fine if they contain stylesheets and image binary.

As far as I know, adding a suffix here shouldn't make a difference, unless there is a server in-between that would use that suffix to synthesise a Content-Type header. While Nginx and Apache would do that, afaik the file-server that browsers use doesn't (except for the navigation URL's file, to decide how to render it).

What has changed, is that Chrome (and other browsers) no longer allow execution of JavaScript code that was served as text/plain. Whereas they do still allow this for stylesheets, images, and video.

In trying this out locally, I can confirm this:

Wikipedia_files
load.php
load(1).php
load(2).php
load(3).php
31px-Commons-logo.svg.png
35px-Mediawiki-logo.png
35px-Wikibooks-logo.svg.png
35px-Wikiquote-logo.svg.png
35px-Wikisource-logo.svg.png
35px-Wikispecies-logo.svg.png
35px-Wikivoyage-Logo-v3-icon.svg.png
35px-Wiktionary-logo-v2.svg.png
41px-Wikiversity_logo_2017.svg.png
47px-Wikidata-logo.svg.png
51px-Wikinews-logo.svg.png
120px-BlemannianabudWaite.jpg
136px-Jaya_Bachchan_still16_(square_crop).jpg
160px-Kyoto_Animation_Studio_1a.jpg
186px-Mary_rose_2019_sideish.JPG
192px-Armia_Czerwona,_Wehrmacht_22.09.1939_wspólna_parada_(wide_crop).jpg
310px-Margaret_Hamilton_-_restoration.jpg
Wikimedia_Community_Logo.svg.png
poweredby_mediawiki_88x31.png
wikimedia-button.png

But, adding ".js" to the JS files did seem to fix it. I guess this means that Chrome maybe has an exemption in its rendering for the file:/// protocol where it whitelists text files whose urls end with ".js". Perhaps we can appeal to Chrome to utilise fact when saving the files from accepted script tags, and add that same suffix when mangling the names.

Alternatively, I wonder if Content-Disposition would work here. That might be a cleaner way.

Btw, I'm also noticing that all images all broken from a saved page due to srcset not being rewritten by the Save Page feature in the browser, and thus the protocol-relative urls take on a different meaning when left alone.

Krinkle removed Krinkle as the assignee of this task.Fri, Jul 26, 4:37 PM
Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.
Krinkle added a subscriber: Krinkle.