Most web performance tools and resources recommend prioritizing above-fold CSS delivery to unblock streaming rendering as early as possible. Linked CSS suffers from a late start (only after the HTML head is loaded), and contention with the parallel HTML load. Inlined CSS avoids this, immediately unblocking the browser to progressively render HTML as it arrives.
Our own results from T113066#1893866 corroborate the huge influence on first paint, especially on slow connections:
| **Strategy / content type** | **First paint @"fast 2g", 840ms rtt** | **First paint @"slow 2g", 1300ms rtt** |
|[Obama with images & external styles](https://en.wikipedia.org/api/rest_v1/page/html/Barack_Obama) | [24.6s](http://www.webpagetest.org/result/151221_PC_A50/) | [92.5s](http://www.webpagetest.org/result/151221_4N_A4H/) |
| [Obama without images, with external styles](https://people.wikimedia.org/~gwicke/Barack_Obama.html) | [29s](http://www.webpagetest.org/result/151221_7R_9WR/) | [52s](http://www.webpagetest.org/result/151221_B1_9WH/) |
| [Obama without images and styles](https://people.wikimedia.org/~gwicke/Barack_Obama_nostyles.html) | [4.5s](http://www.webpagetest.org/result/151220_68_NPS/) | [8.2s](http://www.webpagetest.org/result/151221_JG_9KW/) |
| [Obama without images, inline styles](https://people.wikimedia.org/~gwicke/Barack_Obama_inline_styles.html) | [5s](http://www.webpagetest.org/result/151221_DM_7TQ/) | [11.3s](http://www.webpagetest.org/result/151221_XQ_9QH/) |
| **[Obama with images, inline styles](https://people.wikimedia.org/~gwicke/Barack_Obama_images_inline_styles.html)** | [5s](http://www.webpagetest.org/result/151221_W9_A8Y/); Note deferred below-fold image loading. | [11.4s](http://www.webpagetest.org/result/151221_Z5_A90/) |
| **[Obama lead section only, slim loot HTML](https://future-wikipedia.wmflabs.org/wiki/Barack_Obama)** | [6.2s](http://www.webpagetest.org/result/151221_S8_9XE/) | [10.6s](http://www.webpagetest.org/result/151221_Q4_A35/) |
| **[Obama full page, slim loot HTML](https://future-wikipedia.wmflabs.org/wiki/Barack_Obama?full=1)** | [6s](http://www.webpagetest.org/result/151221_FQ_AWK/) | [19.8s](http://www.webpagetest.org/result/151221_NJ_AWG/) |
A surprising (but tangential) result is that Chrome already seems to defer loading of below-fold images, at least if CSS is available to determine above / below fold status. Time to a rendered & interactive first screen is almost unaffected by image loading if CSS is inlined or generally loaded before above-fold images start loading.
On a Galaxy Note 3 using a wifi connection, Chrome renders the first screen of [Obama with images and inline styles](https://people.wikimedia.org/~gwicke/Barack_Obama_images_inline_styles.html) after about a second. The full page load takes about six seconds. CPU does not seem to be a bottleneck for first paint on this ~2 year old device. Scrolling is smooth all the way through the rendering phase.
### Optimizations: Only inline above-fold CSS
Google's PageSpeed module support a mode that automatically identifies above-fold CSS and inlines them: https://developers.google.com/speed/pagespeed/module/filter-prioritize-critical-css. There are also a variety of node tools (https://github.com/addyosmani/critical-path-css-tools)(https://css-tricks.com/authoring-critical-fold-css/) for similar purposes.
A likely issue with these dynamic solutions is going to be performance. It might make sense to use them as a starting point for a static split of above-fold vs. below-fold CSS instead.
## Alternative approaches to deliver CSS early
### HTTP/2 push
There are some early cache-aware HTTP/2 push implementations like http://blog.kazuhooku.com/2015/12/optimizing-performance-of-multi-tiered.html, but implementing this in our current infrastructure does not seem to be very straightforward.
### ServiceWorker CSS caching / injection
Repeat requests can be sped up by persistently caching & quickly delivering CSS from a ServiceWorker. However, this won't address the large percentage of occasional visits or clients without ServiceWorker support, so can only be seen as a complementary optimization to inlining or HTTP push.
Given the fairly low complexity of a minimal implementation & the very significant performance gains, I think we should look closely at applying it across the board.
For a production deploy, we probably want to investigate how much size we can save with a static above-fold / below-fold RL module split. While even simple inlining is a big gain on 2g, there is a chance that the currently ~16kb extra compressed response size of full inlined RL styles would slightly reduce performance on repeat requests, where the RL response would normally be cached in the client.