Page MenuHomePhabricator

Inline above-fold CSS in HTML response for MediaWiki to reduce time to first paint
Closed, DeclinedPublic

Description

Most web performance tools and resources recommend prioritizing above-fold CSS delivery to unblock streaming rendering as early as possible. Linked CSS suffers from a late start (only after the HTML head is loaded), and contention with the parallel HTML load. Inlined CSS avoids this, immediately unblocking the browser to progressively render HTML as it arrives.

Our own results from T113066#1893866 corroborate the huge influence on first paint, especially on slow connections:

A surprising (but tangential) result is that Chrome already seems to defer loading of below-fold images, at least if CSS is available to determine above / below fold status. Time to a rendered & interactive first screen is almost unaffected by image loading if CSS is inlined or generally loaded before above-fold images start loading.

On a Galaxy Note 3 using a wifi connection, Chrome renders the first screen of Obama with images and inline styles after about a second. The full page load takes about six seconds. CPU does not seem to be a bottleneck for first paint on this ~2 year old device. Scrolling is smooth all the way through the rendering phase.

Optimizations: Only inline above-fold CSS

There is a good variety of tools available that automatically separate above-fold from below-fold styles. One of them is Google's PageSpeed module.

A likely issue with these dynamic solutions is going to be performance. It might make sense to use them as a starting point for a static split of above-fold vs. below-fold CSS instead.

Alternative approaches for early CSS delivery

HTTP/2 push

There are some early cache-aware HTTP/2 push implementations, but implementing this in our current infrastructure does not seem to be very straightforward. Nginx does not support push directly yet, which means that we'd need to use another HTTP2 frontend like nghttp2. It seems likely that nginx will gain support for HTTP2 push in the medium term as well.

ServiceWorker CSS caching / injection

Repeat requests can be sped up by persistently caching & quickly delivering CSS from a ServiceWorker. However, this won't address the large percentage of occasional visits or clients without ServiceWorker support, so can only be seen as a complementary optimization to inlining or HTTP push.

Proposal

Given the fairly low complexity of a minimal implementation & the very significant performance gains, I think we should look closely at applying inlined above-fold styles across the board.

For a production deploy, we should investigate how much size we can save with a static above-fold / below-fold RL module split. While even simple inlining is a big gain on 2g, there is a chance that the currently ~16kb extra compressed response size of full inlined RL styles would slightly reduce performance on repeat requests, where the RL response would normally be cached in the client.

Possible issue: Cache invalidation for CSS

A simple implementation without ESI or similar would couple the cache life time of HTML and CSS. In many cases this is useful, as there is naturally a strong coupling between the two, but there might be dynamic features or general changes that we would prefer to apply more consistently and quickly.

For example, a font change can currently be applied fairly quickly in a way that applies to both old & new cached HTML. In a simple implementation without ESI, the font change would instead take up to a month to apply to all pages, as the lifetime of the CSS is coupled to the HTML.

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added subscribers: GWicke, ori, Jdlrobson and 3 others.

Chatted to @ori and @GWicke about this today.

Main issues with this seem to be around caching (that aside doing this would be a no brainer):

Editor expectations that edits to MediaWiki:Common.css are instantly visible

In the current setup with no edge caching for authenticated users, editors would actually see current styles by default. Anons would not, but many tweaks in common.css tend to be specific to features only authenticated users see / use, so this might not be prohibitive. Common.css vandalism could be an issue, but this problem does not seem to be very different from template vandalism. Both common.css and popular templates are protected, and both take a long time to update.

The biggest gains from inline styles are on clients with poor network connectivity. At the same time, I think expectations around instant common.css updates tend to be lower for the mobile site. It might be a good strategy to pioneer inline styles for mobile first, and then evaluate if the delayed style updates are an issue in practice before considering expanding to the desktop site.

I heard a rumour that @BBlack is looking at the typical maximum shelf life of a page in cache (currently assumed to be 30 days) to see if it is any lower. Am I correct? This will be interesting to know as I'm personally fine with this, given Minerva's design is pretty stable now but I'd suggest we discuss how this impacts any noticeable UI changes we might make in future.

I'm keen for us to explore this near the end of the quarter. CSS is 7.36kb on mobile and currently excludes site CSS (MediaWiki:Mobile.css is loaded via JS at the moment) which means only CSS inside the Minerva skin would be cached up to the limit. I'm happy for the Minerva skin to be a guinea pig.

@Jdlrobson - The typical shelf life of a page is already lower than 30 days. We cap the maximum life at exactly 30 days in our caches, and we're looking at dropping that cap downwards ( T124954 ). Note also that as of a few hours ago, we've temporarily disabled SPDY to test perf impact on both slow and fast SPDY-capable clients, which might impact your measurements (or be useful to re-test against): T125979

Change 270434 had a related patch set uploaded (by Jdlrobson):
PoC: Inline top loaded CSS when MinervaInlineCSS is true

https://gerrit.wikimedia.org/r/270434

Change 270434 abandoned by Jdlrobson:
PoC: Inline top loaded CSS when MinervaInlineCSS is true

https://gerrit.wikimedia.org/r/270434

One option for resolving the conflict between long-term caching and the need to update inlined CSS in a timely manner is to use ServiceWorkers to compose the page. This composition is fairly cheap (hundreds to thousands of dynamic compositions per second & node when run server-side via node-serviceworker-proxy), which makes it feasible to drop Varnish TTLs significantly further without a significant increase in cost.

I prototyped this in https://github.com/gwicke/streaming-serviceworker-playground, and benchmarked this both on the client side & server side via https://swproxy.wmflabs.org/wiki/Foobar and https://swproxy-mobile.wmflabs.org/wiki/Foobar.

Results:

  • Client side with a ServiceWorker installed, inlining CSS does not make a major difference. The full CSS is cached in the ServiceWorker anyway, and will be fully fetched and parsed significantly before the actual content starts streaming in. In the case of a content cache hit, inline CSS can still improve performance slightly, but the differences are in the single-digit ms, and very close to the noise level.
  • Requests without a client-side ServiceWorker see a significant (~25-30%) improvement in first paint times. This is true even when comparing a labs proxy against fully cached Varnish responses. Streaming responses in the ServiceWorker help to keep the time to first byte down & at a similar level as the Varnish cache response.
Krinkle renamed this task from [RFC] Performance: Inline above-fold CSS in HTML response to unblock render and reduce time to first paint to Inline above-fold CSS in HTML response for MediaWiki to reduce time to first paint.Oct 3 2019, 5:07 PM
Krinkle changed the task status from Open to Stalled.
Krinkle removed a project: Performance Issue.

Definitely still interested and aligned with our long-term direction. Currently blocked for similar reasons as T96797.

The Performance Issue tag is for tracking active performance issue. This task however is an epic goal for a general improvement. It does not describe a particular problem or regression. It also isn't going to happen until and unless org attention and priority shifts for this to be resourced as part of larger efforts relating to logged-in and geographic performance and skin rendering, which I'm hoping to get in the plan for next year.

Krinkle changed the task status from Stalled to Open.Sep 26 2021, 10:13 PM
Krinkle claimed this task.

I believe this is largely obsoleted by HTTP/2 and the improved concurrency and priorisation between different streams.

I'll also note that we dont currently have a means of identifying "above the fold" styles from other styles. Which means this either depends on additional complexity left out of hte proposal, or would mean we push the entire ~15KB of styles (after compression) repeatedly and uncached on every single page view. Not only would that be uncaring for the end-users data cost, it would also make page loads slower when measured beyond just the first cold-cache view, as such I don't expect this benefit to hold up in practice as a majority of page loads don't have a cold cache (given a mean average session depth > 3 articles).

I'm self-assigning this to create an updated speed-test scenario to confirm this understanding.

Exciting!
From a skin development perspective, it would be useful to define a single ResourceLoader module name that would make sense to inline inside the skin options defined via skin registration in skin.json. I imagine we'd want to identify above-the-fold styles on a skin level.

Change 724206 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] speed-tests: Add \"Oceanic.enwiki.1046871765\" snapshot

https://gerrit.wikimedia.org/r/724206

Change 724206 merged by jenkins-bot:

[operations/mediawiki-config@master] speed-tests: Add \"Oceanic.enwiki.1046871765\" snapshot

https://gerrit.wikimedia.org/r/724206

Scenarios:

  1. Status quo, cold view: https://en.wikipedia.org/speed-tests/Oceanic.enwiki.1046871765/index.html.
  2. Embedded styles, cold view: https://en.wikipedia.org/speed-tests/Oceanic.enwiki.1046871765/inline_styles.html.
  3. Status quo, warm cache: Visit a different article (warmup), and then Oceanic/index.
  4. Embedded styles, warm cache: Visit a different article (warmup, and then Oceanic/inline_styles.

Metrics (as per guidelines)

  • Visual: Start render (WebPageTest video, 10 fps at 100ms precision), FCP ("First contentful paint", RUM Paint Timing), and LCP ("Largest contentful paint", Web Vitals).
  • Page load time: domComplete (after HTML, CSS, JS, and images have finished loading).
  • Size: Total "Bytes in" (actual transfer size, prior to on-device decompression).

Configuration:

  • WebPageTest, Paris, Chrome.
  • Network: 3G Slow, 3G Fast.
  • Median of 7 runs, by loadTime.
3G SlowFCP (Start render)Page load timePage weight
1. Status quo, cold first3.873s (3.900s)5.372s131 KB
2. Embedded styles, cold first2.337s (2.300s)5.229s131 KB
3. Status quo, warm session0.655s (0.600s)2.560s94 KB
4. Embedded styles, warm session0.994s (1.000s) 2.901s 106 KB
3G FastFCP (Start render)Page load timePage weight
1. Status quo, cold first1.537s (1.500s)1.856s131 KB
2. Embedded styles, cold first1.100s (1.000s)1.678s131 KB
3. Status quo, warm session0.298s (0.300s)0.791s94 KB
4. Embedded styles, warm session0.359s (0.400s) 0.867s 106 KB

More or less as expected, declining as such also because we have plenty of other things to work on that have fewer or no drawbacks, especially in terms of maintenance and on-going developer productivity.

Noting that again this is with a number of things tilted in the experiments favour, such as:

  • The artificial network delay is applied evenly to all request, thus holding back the entire response until a predefined point, including our preload to fetch the stylesheet. It also seems WPT is also doing it evenly. E.g. three requests that are sent at the same time, come back at +600ms, +900ms, and +200ms instead all after +600ms. I'm not quite sure why, but this this detail is essentially why the render times differ. In practice these would generally arrive at the same time.
  • Assumes a theoretical ideal where we'd perfectly know the critical styles and could remove the stylesheet roundtrip entirely. So long as even 1 byte has to come from the stylesheet, the benefits go away completely since it isn't the download size we're reducing but the ordering of network roundtrips in relation to permitting the browser to render. See also unresolved T190083.
  • Would require significant complexity for developers, on-going maintenance, more complicated on-boarding to develop frontend code at WMF.
  • Degrades load time, paint time, and page weight for all but the first cold page view.

In the future, I expect general improvements to make this less and less relevant. For example, improvements for the web at large (in browsers, and HTTP2/HTTP3 protocols), but also in our own architecture:

  • refactoring Skin to be a light transformation atop ParserCache, T140664,
  • improving logged-in experience,
  • potential future involving ServiceWorker proxy that renders a Skin template,
  • improved HTTP preloading,
  • improved cache fragmentation and bundling for stylesheets).

Any number of these would make embedding styles a worse net-negative than today, or in some cases break even, but it's not a foundation that wil reward us further.