>>! In T146099#2657459, @greg wrote:
> File follow-up tasks and resolve this one?
Filed T146510 for investigating and mitigating it before we try async cache-eval again.
>>! In T146099#2657437, @Krinkle wrote:
> * Affected metrics
> * RUM Median story
> * RUM p95 story
> * WebPageTest story
>
> ## Affected metrics
>
> **NavigationTiming metrics around September 8th: **
>
> //(Links to Grafana viewing from 20160801 to 20160918)//
>
> | Metric | Difference | Graph
> |--|--|--
> | [responseStart](https://grafana.wikimedia.org/dashboard/db/navigation-timing?var-metric=responseStart&from=20160801&to=20160918) | //unaffected// |-
> | [domInteractive](https://grafana.wikimedia.org/dashboard/db/navigation-timing?var-metric=domInteractive&from=20160801&to=20160918) (From navigationStart to document received, parsed, and blocking resources ready) | Median: **-50ms** (from 870ms to 820ms)|-
> | | p95: **-6%** (4.5s to 4.2s) | -
> | [firstPaint](https://grafana.wikimedia.org/dashboard/db/navigation-timing?var-metric=firstPaint&from=20160801&to=20160918) | Median: //unaffected// (~1000ms) | {F4509780}
> | | p95: {icon arrow-circle-up color=red} **+441%** (from 4.9s to 21.6s) | {F4509778}
> | [domComplete](https://grafana.wikimedia.org/dashboard/db/navigation-timing?var-metric=domComplete&from=20160801&to=20160918) | Median: {icon arrow-circle-down color=green} **-20%** (from 1900ms to 1500ms) | {F4509802}
> | | p95: -8% (from 9.4s to 8.7s) | -
> | [loadEventEnd](https://grafana.wikimedia.org/dashboard/db/navigation-timing?var-metric=loadEventEnd&from=20160801&to=20160918) | Median: {icon arrow-circle-down color=green} **-25%** (from 2000ms to 1600ms) | {F4509830}
> | | p95: {icon arrow-circle-down color=green} **-10%** (from 10s to 9s) | -
> | [mediaWikiLoadComplete](https://grafana.wikimedia.org/dashboard/db/navigation-timing?var-metric=mediaWikiLoadComplete&from=20160801&to=20160918) (Time startup.js loaded execution and NavTiming module being loaded) | Median: {icon arrow-circle-up color=red} **+400ms** (from 900ms to 1300ms) | {F4509686}
> | | p95: {icon arrow-circle-up color=red} **+100%** (from 5s to 10s) | {F4509687}
>
> ## RUM Median story
>
> Before:
> | | 0s | .. | 870ms | 1000ms | .. | 1900ms | 2000ms
> | Navigation Timing | navigationStart | | domInteractive | firstPaint | | domComplete | loadEvent
> | MediaWiki || mediaWikiLoadStart || //..(+900ms)// | mediaWikiLoadEnd | | |
>
> After:
> | | 0s | .. | 820ms | 1000ms | .. | 1500ms | 1600ms
> | Navigation Timing | navigationStart | | domInteractive | firstPaint | | domComplete | loadEvent
> | MediaWiki || mediaWikiLoadStart || //..(+1300ms)// | mediaWikiLoadEnd | | |
>
> For the median use case it seems like events mostly re-ordered, in a way that ultimately actually makes the page finish loading quicker. Assuming that this is caused by "resourceloader: Make cache-eval asynchronous ", it happened by letting the browser prioritise getting the page ready, on-screen and interactive; in favour of spreading out module execution.
>
> ## RUM p95
>
> Before:
> | | 0s | .. | 4.5s | 4.9s | .. | 9.4s | 10s
> | Navigation Timing | navigationStart | | domInteractive | firstPaint | | domComplete | loadEvent
> | MediaWiki || mediaWikiLoadStart || //..(+5s)// | mediaWikiLoadEnd | | |
>
> After:
> | | 0s | .. | 4.2s | 8.7s | 9s | .. | 21.6s
> | Navigation Timing | navigationStart | | domInteractive | domComplete | loadEvent || firstPaint |
> | MediaWiki || mediaWikiLoadStart || //..(+10s)// || mediaWikiLoadEnd | ||
>
> Here too overall page load is faster. But JS execution is spread out even more. This could be due to various reasons. The most likely reasons is a combination of the following three reasons.
> * requestIdleCallback in chrome causing a delay upto 30s (fixed in master)
> * setTimeout being deprioritized in background tabs (probably acceptable and desirable; not sure if a significant amount of page views initialise in a background tab and/or pre-render).
> * Cache-eval goes in chunks of 50ms. On slower devices (e.g. mobile) will require a lot of chunks, compared to the previous approach of executing everything in one go. However the downside is that this will render the user's browser unresponsive for a long time - which is arguably a much worse user experience. To be confirmed.
>
> ## WebPageTest
>
> <https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown>
> (user: Anonymous, project: enwiki, page: Facebook)
>
> Desktop first view:
> * SpeedIndex: Unaffected.
> * Start render:
> * us-east1/Firefox: Consistently hitting the low-end more frequently (range unchanged).
> * us-east1/Chrome: Range is consistently 100ms lower.
> * us-east1/IE: Unchanged.
> * Fully loaded:
> * us-east1/Firefox: Unchanged.
> * us-east1/Chrome: Consistently 100ms lower.
> * us-east1/IE: Unchanged.
> * mwLoadStart/mwLoadEnd
> * Firefox: Unchanged.
> * Chrome: Consistently ends 100ms earlier.
> * IE: No clear change.
>
> Desktop repeat view (Chrome only):
> <https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?var-view=repeatView>
> * SpeedIndex: 1.2s -> 800ms.
> * Start render: 790ms -> 570ms.
> * Fully loaded: 2.6s -> 2.3s.
>
> Real Mobile (Motorola G, 3G-Fast):
> <https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?panelId=22&fullscreen>
> * SpeedIndex: About 200 lower on average
> * fullyLoaded: No clear change.
> * mwLoadStart/End: No clear change.
> * render: No clear change.
>
> Second page view (Chrome, emulateMobile):
> <https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown?panelId=26&fullscreen>
> * SpeedIndex: -100 (2014 -> 1930)
> * render: -100 (1990 -> 1891)
> * fullyLoaded: +300ms (2.1s -> 2.4s)
> * mwLoadStart: +/-0 (210ms)
> * mwLoadEnd: +200ms (2.4s -> 2.6s)
In addition, it appears ResourceLoader traffic was also affected in interesting ways.
<https://grafana.wikimedia.org/dashboard/db/resourceloader?from=1472080149720&to=1474672149720>
{F4518304 height=300}
{F4518302 height=100}
Detail:
<https://grafana.wikimedia.org/dashboard/db/resourceloader?from=1473339406287&to=1473486088050>
{F4518312 height=350}
{F4518311 height=140}
On September 8 between 18:45 and 19:15 there was a very sudden increase in total traffic to load.php. Most of these extra requests had an If-None-Match header, and were responded to with HTTP 200 and a short cache-expiry.
Log entries confirm that this is indeed also caused by the async cache-eval change:
> 19:02 logmsgbot: demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.18