The Speed of Thought

WikimediaDebug v2 is here!

2019-12-16T18:48:38+00:00

WikimediaDebug is a set of tools for debugging and profiling MediaWiki web requests in a production environment. WikimediaDebug can be used through the accompanying browser extension, or from the command-line.

This post highlights changes we made to WikimediaDebug over the past year, and explains more generally how its capabilities work.

What's new?
Features overview: Staging changes, Debug logging, and Performance profiling.
How does it all work?

§ 1. What's new?

Redesigned

I've redesigned the popup using the style and components of the Wikimedia Design Style Guide.

New design	Previous design

The images above also show improved labels for the various options. For example, "Log" is now known as "Verbose log". The footer links also have clearer labels now, and visually stand out more.

New footer	Previous footer

This release also brings dark mode support! (brighter icon, slightly muted color palette, and darker tones overall). The color scheme is automatically switched based on device settings.

Dark mode

Inline profile

I've added a new "Inline profile" option. This is a quicker and more light-weight alternative to the "XHGui" profile option. It outputs the captured performance profile directly to your browser (as hidden comment at the end of the HTML or CSS/JS response).

Beta Cluster support

This week, I've set up an XHGui server in the Beta Cluster. With this release, WikimediaDebug has reached feature parity between Beta Cluster and production.

It recognises whether the current tab is for the Beta Cluster or production, and adapts accordingly.

The list of hostnames is omitted to avoid confusion (as there is no debug proxy in Beta).
The "Find in Logstash" link points to logstash-beta.wmflabs.org.
The "Find in XHGui" link points to performance-beta.wmflabs.org/xhgui/.

§ 2. Features overview

Staging changes

The most common use of WikimediaDebug is to verify software changes during deployments (e.g. SWAT). When deploying changes, the Scap deployment tool first syncs to an mw-debug host. The user then toggles on WikimediaDebug and selects the staging host.

WikimediaDebug is now active and routes browser activity for WMF wikis to the staging host. This bypasses the CDN caching layers and load balancers normally involved with such requests.

Debug logging

The MediaWiki software is instrumented with log messages throughout its source code. These indicate how the software behaves, which internal values it observes, and the decisions it makes along the way. In production we dispatch messages that carry the "error" severity to a central store for monitoring purposes.

When investigating a bug report, developers may try to reproduce the bug in their local environment with a verbose log. With WikimediaDebug, this can be done straight in production.

The "Verbose log" option configures MediaWiki to dispatch all its log messages, from any channel or severity level. Below is an example where the Watchlist component is used with the verbose log enabled.

One can then reproduce the bug (on the live site). The verbose log is automatically sent to Logstash, for access via the Kibana viewer at logstash.wikimedia.org (restricted link).

Aggregate graphs (Kibana)	Verbose log (Kibana)

Performance profiling

The performance profiler shows where time is spent in a web request. This feature was originally implemented using the XHProf PHP extension (for PHP 5 and HHVM). XHProf is no longer actively developed, or packaged, for PHP 7. As part of the PHP 7 migration this year, we migrated to Tideways which provides similar functionality. (T176370, T206152)

The Tideways profiler intercepts the internals of the PHP engine, and tracks the duration of every subroutine call in the MediaWiki codebase, and its relation to other subroutines. This structure is known as a call tree, or call graph.

The performance profile we capture with Tideways, is automatically sent to our XHGui installation at at https://performance.wikimedia.org (public). There, the request can be inspected in fine detail. In addition to a full call graph, it also monitors memory usage throughout the web request.

Most expensive functions (XHGui)	Call graph (XHGui)

§ 3. How does it all work?

Browser extension

The browser extension is written using the WebExtensions API which Firefox and Chrome implement.

Add to Firefox Add to Chrome

You can find the source code on github.com/wikimedia/WikimediaDebug. To learn more about how WebExtensions work, refer to MDN docs, or Chrome docs.

HTTP header

When you activate WikimediaDebug, the browser is given one an extra HTTP header. This header is sent along with all web requests relating to WMF's wiki domains. Both those for production, and those belonging to the Beta Cluster. In other words, any web request for *.wikipedia.org, wikidata.org, *.beta.wmflabs.org, etc.

The header is called X-Wikimedia-Debug. In the edge traffic layers of Wikimedia, this header is used as signal to bypass the CDN cache. The request is then forwarded, past the load balancers, directly to the specified mw-debug server.

Header Format

X-Wikimedia-Debug: backend= [ ; log ] [ ; profile ] [ ; forceprofile ] [ ; readonly ]

mediawiki-config

This HTTP header is parsed by our MediaWiki configuration (wmf/profiler.php, and wmf/logging.php).

For example, when profile is set (the XHGui option), profiler.php invokes Tideways to start collecting stack traces with CPU/memory information. It then schedules a shutdown callback in which it gathers this data, connects to the XHGui database, and inserts a new record. The record can then be viewed via performance.wikimedia.org.

Wikipedia's JavaScript initialisation on a budget

2019-09-18T22:46:41+00:00

This week saw the conclusion of a project that I've been shepherding on and off since September of last year. The goal was for the initialisation of our asynchronous JavaScript pipeline (at the time, 36 kilobytes in size) to fit within a budget of 28 KB – the size of two 14 KB bursts of Internet packets.

In total, the year-long effort is saving 4.3 Terabytes a day of data bandwidth for our users' page views.

The above graph shows the transfer size over time. Sizes are after compression (i.e. the net bandwidth cost as perceived from a browser).

How we did it

The startup manifest is a difficult payload to optimise. The vast majority of its code isn't functional logic that can be optimised by traditional means. Rather, it is almost entirely made of pure data. The data is auto-generated by ResourceLoader and represents the registry of module bundles. (ResourceLoader is the delivery system Wikipedia uses for its JavaScript, CSS, interface text.)

This registry contains the metadata for all front-end features deployed on Wikipedia. It enumerates their name, currently deployed version, and their dependency relationships to other such bundles of loadable code.

I started by identifying code that was never used in practice (T202154). This included picking up unfinished or forgotten software deprecations, and removing unused compatibility code for browsers that no longer passed our Grade A feature-test. I also wrote a document about Page load performance. This document serves as reference material, enabling developers to understand the impact of various types of changes on one or more stages of the page load process.

Fewer modules

Next was collaborating with the engineering teams here at Wikimedia Foundation and at Wikimedia Deutschland, to identify features that were using more modules than is necessary. For example, by bundling together parts of the same feature that are generally always downloaded together. Thus leading to fewer entry points to have metadata for in the ResourceLoader registry.

Some highlights:

WMF Editing team: The WikiEditor extension now has 11 fewer modules. Another 31 modules were removed in UploadWizard. Thanks Ed Sanders, Bartosz Dziewoński, and James Forrester.
WMF Language team: Combined 24 modules of the ContentTranslation software. Thanks Santhosh Thottingal.
WMF Reading Web: Combined 25 modules in MobileFrontend. Thanks Stephen Niedzielski, and Jon Robson.
WMDE Community Wishlist Team: Removed 20 modules from the RevisionSlider and TwoColConflict features. Thanks Rosalie Perside, Jakob Warkotsch, and Amir Sarabadani.

Last but not least, there was the Wikidata client for Wikipedia. This was an epic journey of its own (T203696). This feature started out with a whopping 248 distinct modules registered on Wikipedia page views. The magnificent efforts of WMDE removed over 200 modules, bringing it down to 42 today.

The bar chart above shows small improvements throughout the year, all moving us closer to the goal. Two major drops stand out in particular. One is around two-thirds of the way, in the first week of August. This is when the aforementioned Wikidata improvement was deployed. The second drop is toward the end of the chart and happened this week – more about that below.

Less metadata

This week's improvement was achieved by two holistic changes that organised the data in a smarter way overall.

First – The EventLogging extension previously shipped its schema metadata as part the startup manifest. Roan Kattouw (Growth Team) refactored this mechanism to instead bundle the schema metadata together with the JavaScript code of the EventLogging client. This means the startup footprint of EventLogging was reduced by over 90%. That's 2KB less metadata in the critical path! It also means that going forward, the startup cost for EventLogging no longer grows with each new event instrumentation. This clever bundling is powered by ResourceLoader's new Package files feature. This feature was expedited in February 2019 in part because of its potential to reduce the number of modules in our registry. Package Files make it super easy to combine generated data with JavaScript code in a single module bundle.

Second – We shrunk the average size for each entry in the registry overall (T229245). The startup manifest contains two pieces of data for each module: Its name, and its version ID. This version ID previously required 7 bytes of data. After thinking through the Birthday mathematics problem in context of ResourceLoader, we decided that the probability spectrum for our version IDs can be safely reduced from 78 billion down to "only" 60 million. For more details see the code comments, but in summary it means we're saving 2 bytes for each of the 1100 modules still in the registry. Thus reducing the payload by another 2-3 KB.

Below is a close-up for the last few days (this is from synthetic monitoring, plotting the raw/uncompressed size):

The change was detected in ResourceLoader's synthetic monitoring. The above is captured from the Startup manifest size dashboard on our public Grafana instance, showing a 2.8KB decrease in the uncompressed data stream.

With this week's deployment, we've completed the goal of shrinking the startup manifest to under 28 KB. This cross-departmental and cross-organisational project reduced the startup manifest by 9 KB overall (net bandwidth, after compression); From 36.2 kilobytes one year ago, down to 27.2 KB today.

We have around 363,000 page views a minute in total on Wikipedia and sister projects. That's 21.8M an hour, or 523 million every day (User pageview stats). This week's deployment saves around 1.4 Terabytes a day. In total, the year-long effort is saving 4.3 Terabytes a day of bandwidth on our users' page views.

What's next

It's great to celebrate that Wikipedia's startup payload now neatly fits into the target budget of 28 KB – chosen as the lowest multiple of 14KB we can fit within subsequent bursts of Internet packets to a web browser.

The challenge going forward will be to keep us there. Over the past year I've kept a very close eye (spreadsheet) on the startup manifest — to verify our progress, and to identify potential regressions. I've since automated this laborious process through a public Grafana dashboard.

We still have many more opportunities on that dashboard to improve bundling of our features, and (for Performance Team) to make it even easier to implement such bundling. I hope these on-going improvements will come in handy whilst we work on finding room in our performance budget for upcoming features.

– Timo Tijhof

Further reading:

Metrics & Perf reports, on performance.wikimedia.org
ResourceLoader Architecture, on mediawiki.org.
The 14KB Initial Window, by Tyler Cipriani

Tracking down slow event handlers with Event Timing

2019-06-19T15:17:49+00:00

We're taking part in the ongoing Event Timing Chrome origin trial, in order to experiment with that API early and give feedback to its designers. The goal of this upcoming API is to surface slow events. This is an area of web performance that hasn't gotten a lot of attention before, but one that can be very frustrating for users. Essentially, when slow events occur, users are trying to interact with the page and it's being unresponsive. Not a desirable user experience.

Slow event handlers

Two phases of an event's lifecycle can take too long: its queueing time and its event handler time. When queueing time is long, it's an indication that the browser is busy with something else. Most likely, things that the Long Tasks API would capture.

What we focused on in our trial were events whose handlers were slow. Since Wikipedia doesn't run any 3rd-party code, if an event handler is slow, it's our fault. And hopefully we can do something about it.

In order to determine whether slow events are happening on our page, we set up a PerformanceObserver listening to Event Timing entries:

function setupEventTimingObserver() {
  var observer;

  if ( !window.PerformanceObserver ) {
    return;
  }

  observer = new PerformanceObserver( observeEventTiming );

  try {
    observer.observe( { entryTypes: [ 'event' ], buffered: true } );
  } catch ( e ) {
    // If EventTiming isn't available, this errors because we try subscribing to an invalid entryType
  }
}

Then, in the entries this observer collects, we're interested in these properties:

// The time the first event handler started to execute.
// |startTime| if no event handlers executed.
readonly attribute DOMHighResTimeStamp processingStart;
// The time the last event handler finished executing.
// |startTime| if no event handlers executed.
readonly attribute DOMHighResTimeStamp processingEnd;

The event handler duration is simply the delta between these PerformanceEventTiming object properties (processingEnd - processingStart).

Great! Now we know when event handlers take a while to run, and we know the event type (eg. click, mousemove, etc.). But how can we figure out which part of our UX these events came from?

Cross-referencing with regular events

Our workaround to only knowing about an event type and its timing information is to listen to events of interest (in our case, clicks) on the whole document. If you capture all events, you're bound to run into the slow ones...

$( document ).on( 'click', function listener( e ) {
  // do something with the event
}

How can we cross-reference them? Well, conveniently, an event's timeStamp property is identical to the corresponding PerformanceEventTiming startTime property. By cross-referencing types and timestamps, we can figure out which events in the document were slow. And we can get actionable information, such as the event's target.

Now we're all set, we can collect events with slow handlers and figure out which user interaction they came from. By walking up the DOM tree from the target, we can figure out exactly what users interacted with, that triggered a slow event handler.

What we've found

Using this technique and deploying it to production on 2 Wikipedias (Russian and Spanish), we quickly identified 3 very frequent slow click handlers experienced frequently by real users on Wikipedia. Those are taking more than 50ms thousands of times per day for users of those wikis:

T226023: Media Viewer detach/shutdown can be expensive
T226025: Expensive viewport size access in Reference Drawers
T225946: [SPIKE 8hrs] Determine remedy for MobileFrontend lazy-loading images performance issues

Two of those issues are caused by expensive javascript calls causing style recalculation and layout. A common performance pitfall, because those calls are quite innocent-looking. Paul Irish has put together this very handy list of JS features that trigger that problem.

Hopefully we should be able to replace the offending code with CSS-only solution, or at the very least mitigate their reliance on these expensive calls, so that users can always have a responsive experience when they click on those UI elements.

Beyond these top 3 issues, the Event Timing API is surfacing a very long tail of small performance problems in corners of our UX that are worth improving. It's shedding light on a lot of different potential sources of user frustration.

Feedback

By doing this work, it became self-evident that having the event target directly in the Event Timing API would be very convenient. It would let us avoid the overhead of listening to all events that might be slow and remove the cross-referencing effort. This is why we joined Nic Jansma's request to have more context in the API. This is precisely what origin trials are for, and we are glad to have been able to express our operational needs early, which should hopefully contribute to the final design of that new browser API.

The Event Timing API origin trial runs until July 24 on Chrome 68-75, so you can already give it a spin in production and see if you find slow events on your own site!

Performance perception: correlation to RUM metrics

2019-06-17T11:56:27+00:00

When we set out to ask Wikipedia visitors their opinion of page load performance, our main hope was to answer an age-old question: which RUM metric matters the most to users? And more interestingly, which ones matter the most to our users on our content.

Now that we have a lot of user input with our micro survey running for over a year, we can look at which classic RUM metrics correlate the best to users' perception of the page load performance.

Methodology

We collect user responses from an in-page micro survey asking them if the page load was fast enough. We map their responses to 1 for positive answers, -1 for negative answers and we discard neutral "I don't know" answers. We only look at records where a given RUM metric is present, and for time-based metrics, only if the value is lower than 30 seconds. Beyond that point we know for certain that the experience was terrible or that there was an issue with metric collection.

Results

Metric	Pearson coefficient	Sample size
domInteractive	-0.149	1,709,435
firstContentfulPaint	-0.144	985,662
firstPaint	-0.143	1,057,157
domComplete	-0.142	1,703,476
loadEventEnd	-0.142	1,703,441
loadEventStart	-0.142	1,703,388
top thumbnail (Element Timing for Images origin trial)	-0.138	28,070
responseStart	-0.131	1,705,859
RUMSpeedIndex	-0.129	1,319,177
secureConnectionStart	-0.128	942,602
requestStart	-0.120	1,694,865
connectEnd	-0.119	1,596,297
redirecting	-0.109	33,056
domainLookupEnd - domainLookupStart	-0.0965	670,932
connectStart	-0.096	1,595,544
netinfoEffectiveConnectionType	0.0845	1,301,435
deviceMemory	0.0663	1,286,609
fetchStart	-0.0521	1,444,012
unloadEventEnd - unloadEventStart	-0.03089	29,854
cpu benchmark score	-0.00615	1,696,239
transferSize	-0.00208	1,358,990

Pearson correlation factors can go from 1 to -1, meaning that even our "best" correlations are actually the least terrible ones. Overall RUM metric correlation is quite poor and an indication that they only represent a small part of what constitutes the perceived performance of a page load.

Analysis

There is a clear pattern of environmental properties having the worst correlation. Effective connection type, device memory, available CPU, page transfer size. This might suggest that users are aware of their device, network quality and page size (small vs big article in Wikipedia's case) and adjust their expectations to those factors.

As for actual RUM metrics, it's interesting to see that the top ones are not just the paint metrics, but also domInteractive. The reason they are so close to each other is probably that in Wikipedia's case there are very close metrics in general, due to the absence of 3rd-party assets on our pages.

Conclusion

Thanks to this real-world opinion data, we can make a better educated guess about which RUM metric(s) matter the most to us. It also shows how sub-par existing RUM metrics are in general. We encourage the development on new metrics that capture other aspects of performance than the initial page loader/render, as this part seems well covered already, with seemingly very little difference in terms of correlation to perceived performance between them, at least in our case.

The performance perception micro survey will keep running and will allow us to benchmark future APIs. Which we intend to do with our ongoing Layout Instability API origin trial, for example, once the fixes of the bugs we discovered during the trial have been rolled out.

Performance perception: the effect of late-loading banners

2019-06-13T14:12:40+00:00

Unlike most websites, Wikipedia and its sister projects are ad-free. This is actually one of the reasons why our performance is so good. We don't have to deal with slow and invasive third-parties.

However, while we don't have ads, we do display announcement and fundraising banners frequently at the top of wikis. Here's an example:

Those are driven by JS and as a result always appear after the initial page render. Worse, they push down content when they appear. This is a long-standing technical debt issue that we hope to tackle one day. One of the most obvious issues we deal with that may impact performance perception. How big is the impact? With our performance perception micro survey asking our visitors about page performance, we can finally find out.

Perception distribution

We can look at the distribution (Y axis) of positive and negative survey answers based on when the banner was injected into the DOM, in milliseconds (X axis).

We see the obvious pattern that positive answers to the micro-survey question (did this page load fast enough?) are more likely if the banner appeared quickly. However, by looking at the data globally like this, we can't separate the banner's slowness from the page's. After all, if your internet connection and device are slow, both the page itself and the banner will be slow, and users might be responding based on the page, ignoring the banner. This distribution might be near identical to the same being done for page load time, regardless of a banner being present or not.

Banner vs no banner

A simple way to look at this problem is to check the ratio of micro-survey responses for pageviews where a banner was present vs pageviews where there was no banner. Banner campaigns tend to run for specific periods, targeting certain geographies, meaning that a lot of visits don't have a banner displayed at all. Both samples sizes should be enough to draw conclusions.

Corpus	User satisfaction ratio	Sample size
No banner or answered before banner	86.64%	1,111,542
Banner and answered after banner	87.8%	311,332

For the banner case, we didn't collect whether the banner was in the user's viewport (i.e. was it seen?).

What is going on? It would seem that users are slightly more or equally satisfied of the page performance when a banner is injected. It would suggest that our late-loading banners aren't affecting page performance perception. This sounds too good to be true. We're probably looking at data too globally, including all outliers. One of our team's best practices when findings that are to good to be true appear is to keep digging to try to disprove it. Let's zoom in on more specific data.

Slow vs fast banners

Let's look at "fast" pageloads, where loadEventEnd is under a second. That event happens when the whole page has fully loaded, including all the images.

Corpus	User satisfaction ratio	Sample size
Banner injected into DOM before loadEventEnd	92.66%	4,761
Banner injected into DOM less than 500ms after loadEventEnd	92.03%	67,588
Banner injected into DOM between 2 and 5 seconds after loadEventEnd	85.33%	859

We can see that the effect on user performance satisfaction starts being quite dramatic as soon as the banner is really late compared to the speed of the main page load.

What if the main pageload is slow? Are users more tolerant of a banner that takes 2-5 seconds to appear? Let's look at "slow" pageloads, where loadEventEnd is between 5 and 10 seconds:

Corpus	User satisfaction ratio	Sample size
Banner injected into DOM before loadEventEnd	79.13%	3019
Banner injected into DOM less than 500ms after loadEventEnd	78.45%	2488
Banner injected into DOM between 2 and 5 seconds after loadEventEnd	76.17%	2480

While there is a loss of satisfaction, it's not as dramatic as for fast pages. This makes sense, as users experiencing slow page loads probably have a higher tolerance to slowness in general.

Slicing it further

We've established that even for a really slow pageload, the impact of a slow late-loading banner is already visible at 2-5 seconds. If it happens within 500ms after loadEventEnd, the impact isn't that big (less than 1% satisfaction drop). Let's look at the timespan after loadEventEnd in more detail for fast pageloads (< 1s loadEventEnd) in order to find out where things start to really take a turn for the worse.

Here's the user page performance satisfaction ratio, based on how long after loadEventEnd the banner was injected into the DOM:

Conclusion

The reason why the issues caused by late-loading banners when looking at data globally is probably because most of the time banners load fast. But when they happen after loadEventEnd, users start to be quite unforgiving, with the performance satisfaction ratio dropping rapidly. For users with an otherwise fast experience, we can't afford for banners to be injected more than 500ms after loadEvendEnd if we want to maintain a 90% satisfaction ratio.

Of course, we would like to change our architecture so that banners are rendered server-side, which would get rid of the issue entirely,. But in the meantime loadEventEnd + 500ms seems like a good performance budget we should aim for if we want to mitigate the user impact of our current architectural limitations.

Performance perception: how satisfied are Wikipedia users?

2019-05-29T17:17:24+00:00

We've recently published research on performance perception that we did last year. The micro survey used in this study is still running on multiple Wikipedia languages and gives us insights into perceived performance.

The micro survey simply asks users on Wikipedia articles, in their own language, if they think that the current page loaded fast enough:

Let's look at the results on Spanish and Russian Wikipedias, where we're collecting the most data. We have collected more than 1.1 million survey responses on Spanish Wikipedia and close to 1 million on Russian Wikipedia so far. The survey is displayed to a small fraction of our visitors.

How satisfied are our visitors with our page load performance?

Ignoring neutral responses ("I'm not sure"), we see that consistently across wikis between 85 and 90% of visitors find that the page loaded fast enough. That's an excellent score, one that we can be proud of. And it makes sense, considering that Wikipedia is one of the fastest websites on the Web.

Now, a very interesting finding is that this satisfaction ratio varies quite a bit depending on whether you're logged into the website, or if like most Wikipedia visitors, you're logged out:

wiki	status	sample size	satisfaction ratio
spanish	logged in	1,500	89.70%
spanish	logged out	1,109,205	85.82%
russian	logged in	7,093	92.28%
russian	logged out	885,926	85.82%

It appears that logged-in users are consistently more satisfied about our performance than logged-out visitors.

The contributor performance penalty

Andres Apevalov — Press team of Prima Vista Literature Festival, CC BY-SA 4.0

What's very surprising about logged-in users being more satisfied is that we know for a fact that the logged-in experience is slower. Because our logged-in users have to reach our master datacenter in the US, instead of hitting the cache point of presence closest to them. This is a long-standing technical limitation of our architecture. An issue we intend to resolve one day.

Why could they possibly be happier, then?

The Spanish paradox

Spanish Wikipedia, at first glance, seems to contradict this phenomenon of slower page loads for logged-in users. Looking at the desktop site only (to rule out differences in the mobile/desktop mix):

wiki	status	median loadEventEnd
spanish	logged in	1400.5
spanish	logged out	1834
russian	logged in	1356
russian	logged out	1075

The reason why - contrary to what we see on other wikis and at a global scale - Spanish Wikipedia page loads seem faster for logged-in users, is that Spanish Wikipedia traffic has a very peculiar geographic distribution. Logged-in users are much more likely to be based in Spain (30.04%) than in latin american countries than their logged-out counterparts (22.3%). Since internet connectivity tends to be faster in Spain, this ratio difference explains why the logged-in experience appears to be faster - but isn't - when looking at RUM data at the website level.

This is a very common pitfall of RUM data, where seemingly contradicting results can emerge depending on how you slice the data. RUM data has to be studied from many angles before drawing conclusions.

Caching differences

Looking at the Navigation Timing data we collect for survey respondants, we see that for logged-in users the median connect time on Spanish Wikipedia is 0 and for logged-out users it's 144ms. This means that logged-in users view a lot of pages and the survey mostly ends up being displayed on their nth viewed page, where n is more than 1, because their browser is already connected to our domain. Whereas for a lot of logged-out users, we capture their first page load, with a higher probability of a cold cache. This means that logged-in users, despite having a (potential) latency penalty of connecting to the US, tend to have more cached assets, particularly the JS and CSS needed by the page. This doesn't fully compensate the performance penalty of connecting to a potentially distant datacenter, but it might reduce the variability of performance between page loads.

In order to further confirm this theory, in the future we could try to record information about how much of the JS and CSS was already available in the browser cache and the time the page load happened. This is not information we currently collect. Such data would allow us to confirm whether or not satisfaction is correlated to how well cached dependencies are, regardless of the user's logged-in/logged-out status.

Brand affinity?

Becoming a Wikipedia contributor - and therefore, logging in - requires a certain affinity to the Wikipedia project. It's possible, as a result, that logged-in users have a more favourable view of Wikipedia than logged-out users on average. And that positive outlook might influence how they judge the performance of the website.

This is a theory we will explore in the future by asking more questions in the micro survey, in order to determine whether or not the user who responds has a positive view of our website in general. This would allow us to quantify how large the effect of brand affinity might be on performance perception.

Evaluating Element Timing for Images

2019-04-10T08:24:30+00:00

In the search for a better user experience metric, we have tried out the upcoming Element Timing for Images API in Chrome.

Background

One of the tasks we in the performance team have been struggling with is finding better metrics that can tell us more about the user experience than the technical metrics we usually get out of browsers.

We started out 2015 trying to find a way to know when images are displayed for the user. We tried out the latest patterns at that moment in T115600. We used our WebPageTest instance to record a video of the browser loading the Obama page, and followed state of the art technology at that moment using a User Timing mark to fire when the image was displayed.

The results were very disappointing. The mark was at 2.0 seconds, but as you can see in the screenshot, the image was displayed at 4.8 seconds. It was off by 2.8 seconds :( We did multiple tests and we got the same result multiple times. We tried the state-of-the-art technique people where talking about and it was clearly completely wrong. This taught us the important lesson the reliability of new RUM metrics we decide to collect need to be verified in synthetic testing using a video recording of the browser.

The next attempt to measure when images appear was when WebPageTest added support for visual element metrics (meaning analyzing a video and getting metrics for specific elements), but that only helps us with synthetic testing. We also want better metrics collected directly from our users.

Element timings

@Gilles has been working on enabling origin trials for Chrome for us to verify the effectiveness and usefulness of upcoming performance APIs. Recently we enabled the Trial for Element Timing for Images on Russian Wikipedia. The goal of this API is to report exactly what we had been looking for: when an image is actually displayed to the user.

Let's verify the accuracy of this new metric and see if it works better than old approximations marked with user timings.

Evaluating element timings

Using Browsertime we record a video of the screen and run some extra JavaScript to collect the new metric. Then we compare the metric we get from JavaScript with the one we get from the video.

The first large image in an article is named thumbnail-high, so we know which one to use. The following JavaScript snippet is what allows us to get the Element Timing metric just for that element:

(function() {
    const elements = performance.getEntriesByType('element');
    for (let element of elements) {
        if (element.name === 'thumbnail-high') {
            return element.startTime;
        }
    }
})();

This is passed to Browsertime, which runs it after the page has loaded. Visual Elements are enabled, which analyses the video and gives us a timing corresponding to when the largest image within the viewport is displayed (which for most articles, is the thumbnail-high image).

$ docker run --rm -v "$(pwd)":/browsertime sitespeedio/browsertime:4.6.0 --script thumbnail-high.js https://ru.wikipedia.org/wiki/Древесные_стрижи -n 11 --visualElements

This was run on two different connectivity types and 11 times in a row. Then we keep the median for both metrics and we get the following:

URL	Connectivity	Largest Image from video (ms)	Element Timing (ms)
https://ru.wikipedia.org/wiki/Древесные_стрижи	cable	1100	1097
https://ru.wikipedia.org/wiki/Древесные_стрижи	3g	1567	1536

The video recording performed by browsertime is done at 30 frames per second. Which means each frame lasts 1000/30 = 33.333ms. This indicates that the differences seen between Element Timing and the video analytics are within one frame. Element Timing might very well be the more accurate one, since it's not constrained by the video recording's 30fps cadence.

That looks really promising and very accurate, particularly compared to old workarounds. We tested a couple more URLs that you can see in T219231 and they showed the same result.

For our content, it looks like the Element Timing API finally provides a way for us to know accurately when images are really displayed to users!

Autonomous Systems performance report

2019-03-27T11:07:06+00:00

Today we're publishing our first report of the performance experienced by visitors of Wikimedia websites, focused on the Autonomous Systems visitors are connecting from.

This report will be updated monthly, with historical data made available. The goal is to watch the evolution of these metrics over time, allowing us to identify improvements and potential pain points.

In order to make a fair assessment of the autonomous systems' performance, real user metrics collected from web browsers are normalised, in order to avoid differences such as average device power for a given network's users potentially skewing the results. For example, an ISP with more expensive data plans might have users with more expensive, better performing devices on average. This is way we compare data points only for similar effective device CPU power between providers. We also separate the mobile and desktop experiences, because they serve different content, with a notable difference in the median page weight, which directly impacts performance metrics. We wouldn't want the mobile/desktop mix of a given provider to influence the results.

If you look at the report, you might wonder why some autonomous systems' underlying mobile networks show up under "desktop" and some wired internet providers appear under "mobile". The explanation is that the internet providers either sell home internet devices that are effectively mobile network modems, resulting in people using their desktop computers (and as a result, the desktop websites) over a mobile network. Or the providers have mobile device users automatically connect to the same provider's WiFi routers when users are in reach of one.

One caveat about this report is that in countries that are physically large, like the United States, the country-wide aggregation in no way reflects important regional differences there might be for a given network. The main reason why we can't look at smaller regions is that we have simply no way of knowing where mobile users are connecting from, short of collecting geolocation data. Since we care deeply about our user's privacy and their experience, it doesn't feel appropriate at this time to ask users for their precise location in order to generate this type of finer-grained data. Such a scheme would also suffer from self-selection bias. There's already a lot of work to be done with the data aggregated at the national level!

We hope that this public report will help network operators understand their customers' real performance characteristics when it comes to browsing one of the web's largest websites. We are welcoming of peering requests networks might want to propose, should they seek to improve their connectivity to our datacenters.

Debugging production with X-Wikimedia-Debug

2019-02-20T16:15:05+00:00

In February 2018, a user reported that some topics created by users on Flow discussion boards were not appearing in the Recent Changes feeds, including EventStreams and the IRC-RC feed. Various automated patrol systems rely on EventStreams, so the bug meant a number of edits bypassed those systems on Flow-enabled wikis.

When approaching a bug like this, there are typically three things I do:

Determine the steps to reproduce the bug. That was already done by the task author (thank you @Rxy!) and then confirmed by other contributors to the task (h/t @Krinkle, @Etonkovidova)
Attempt to reproduce the issue locally and set breakpoints in code to understand why the problem occurs
Check the production logs to look for any messages related to the bug report

Unfortunately the problem was not reproducible in the MediaWiki Vagrant development environment. Nor were there any relevant messages in the logs. Since reproducing the issue locally wasn't possible, we merged some diagnostic code but still had nothing. Early on, @SBisson suggested a hypothesis about the code path involved in emitting the event:

if ( user is trusted ) 
  return true
else
  let's load the revision from replica, return true based on the the status of the revision
  oh it doesn't exist (yet), return false

But we could not reproduce this, nor could we identify exactly where this might occur since the code paths for this functionality had many points where execution could stop silently.

Enter X-Wikimedia-Debug

One of the useful tools in our stack is the X-Wikimedia-Debug header. I knew about this header (and its browser extensions) from verifying changes that were being SWAT'ed into production but I had not thought to use it for tracking down a production bug.

I was using the browser extension with the "Log" checkbox ticked (and still not finding anything useful in Logstash to help isolate this bug) when I realized that I could also profile the problematic request. When you check the box to profile a request, XHProf will profile the code that's executed and make the result available for viewing via XHGui.

Typically you do this to understand performance bottlenecks in your code, as get a complete list of all functions executed during the request, along with the time and memory usage associated with each function.

I followed the steps to reproduce and then switched on the "Profile" option before posting a new topic on an empty Flow board. Now, I had a profiled request which provided me with information on all the methods called, including which method called another (click on a method call to see its parent and children method calls). From here I could follow the path traversed by Flow's event emitting code, and see exactly where the code execution halted.

Reproducing the bug locally

With this knowledge, I went back to my local environment, this time using MediaWiki-Docker-Dev, which has database replication set up as part of its stack (MediaWiki Vagrant does not). I set some breakpoints in the code I suspected was causing the problem, and then found that in RevisionActionPermissions.php#isBoardAllowed(), we had this code:

$allowed = $this->user->isAllowedAny( ...(array)$permissions );
if ( $allowed ) {
    return true;
}
return !$workflow->isDeleted();

For a new topic on a blank flow board, $permissions is deletedtext, which would return true for privileged users. But for unprivileged users, Flow would check !$workflow->isDeleted();, and this evaluated as false because the code was querying the database replica, and the title did not exist there yet.

The submitted solution was to patch isDeleted() to query the master DB when in the context of a POST request, since we know the title would exist in the master DB. With this patch in place, events were once again emitted properly and the bug was fixed.

Conclusion

A few of my conclusions from this experience:

If you're having difficulty tracking down the code path, consider using the profiler in the X-Wikimedia-Debug browser extension
Diagnostic code is helpful (even if it didn't pinpoint the problem here) and debug level logging should be considered instead of silent returns
Having database replication in your local development environment can help catch issues while developing and when attempting to reproduce a production issue. One can use the MediaWiki-Docker-Dev environment for this, and see also how to adjust its database replication lag.

Kosta Harlan
Senior Software Engineer
Growth Team

Learn more about the X-Wikimedia-Debug header and browser extension on Wikitech.

Magic Numbers

2019-01-24T10:40:32+00:00

Previously posted on the 2018 Performance Calendar

Guidelines like RAIL are popular in the web performance community. They often define time limits that must be respected, like 100ms for what feels instantaneous, or 1000ms for the limit of acceptable response time.

Prominent people in the performance community keep telling us that there's a lot of science behind those numbers.

I've always been skeptical of that claim, and earlier last year I set out to find out if there's any merit to those numbers by doing an extensive literature review of web performance perception academic research. Here are some of the findings from that project.

Following the citation trail

If you follow paper citations, some classic papers keep showing up as references. And in the world of web performance, two
papers get cited a lot more than any other.

Response Time in Man-Computer Conversational Transactions by Miller, 1968 and Response Times: The 3 Important Limits by Nielsen, 1993/2014.

Nielsen essentially takes some of the numbers from the Miller paper, brushes the dust of off them since they were pre-web and presents them in a simpler fashion that everyone understands, stating that they apply to the web. What Nielsen doesn't do, however, is prove that those numbers are true with research of any kind. Jakob Nielsen is simply stating these limits as facts, but no science has been done to prove that they are true. And ever since, the entire web community has believed what a self-proclaimed expert said on the matter and turned it into guidelines. Surely, if an authoritative-looking man with glasses who holds a PhD in HCI states something very insistingly, it must be true.

Trust me, I know things!

What about the Miller paper? After all, if Nielsen insists that those principles are an absolute truth that hasn't changed in 50 years, maybe it's because Miller's research was so compelling to start with? I think everyone who believes that the numbers found in RAIL and similar guidleines are real should read the Miller paper, the origin of these pervasive magic numbers. Not only Miller doesn't back up any magic number stated with any research of any kind - it's really just a giant subjective essay - it contains gems that Nielsen didn't seem to find useful to include in his cleaned up version of it:

If he has made an error that the system can detect, he should be allowed to complete his segment of thought hefore he is interrupted or told he is locked out. After two seconds and before four seconds following completion of keying in his "thought" he should be informed of his error and either "told" to try again, or told of the error he made.
Comment: It is rude (i.e., disturbing) to be interrupted in mid-thought. The annoyance of the interruption makes it more difficult to get back to the train of thought. The two-second pause enables the user to get his sense of completion following which an error indication is more acceptable.

Miller advocates to intentionally delay errors by a whole 2 seconds, in order to avoid disturbing the user's train of thought. If it sounds silly and dated, it's because it is, just like the rest of Miller's paper. Like Nielsen's, it means well, but pulls magic numbers out of thin air. Not a single experiment was conducted, not a single human being studied or surveyed in the making of these magic numbers. No research data to verify the claims.

What happens when you do real science

Are 100 ms Fast Enough? Characterizing Latency Perception Thresholds in Mouse-Based Interaction by Forch, Franke, Rauh, Krems 2017 looked into one of the most popular magic numbers from the Miller/Nielsen playbook: 100ms as the treshold for what feels instantaneous. Here's the key result of that study:

The latency perception thresholds’ range was 34–137 ms with a mean of 65 ms (Median = 54 ms) and a standard deviation of 30 ms.

This is quite different than the 100ms universal treshold we keep hearing about. The study goes on to show that subjects with a habit of playing action video games tend to have a lower threshold than others. Showing that cultural difference can affect that limit.

Googler revealing the next iteration of RAIL guidelines

When you think about it, it does make sense that the real threshold is a range that depends on demographics, and that there's no reason there should be a universal threshold that happens to be a round number. It would be all too magical, wouldn't it?

Proving universal facts about mankind based on students down the hall

Can you spot the person younger than 19 or older than 36?

A major weakness in a lot of papers doing real science I've reviewed, however, is that when actual research on people is done, it's usually on a group that lacks diversity. It's often whoever scientists have easy access to. Typically students from the same university. They're subjects that are educated, proficient with technology use and often with a monetary incentive to participate, which obviously skews the results. And yet, after performing a study on a dozen paid students, these research papers will often claim to have proven a universal truth about all human beings.This is actually true of the study I quoted earlier about the 100ms threshold, with the minor difference of students earning credits rather than money. Here's their description of study participants:

Twenty students (10 female, age 19–36 years, M = 23.45, SD = 3.32) which were recruited via the local psychology student mailing list took part in the experiment. All participants had normal or corrected-to-normal vision and normally used their right hand for handling computer mice. Participants signed an informed consent sheet at the beginning of the experiment and received partial course credit for participation.

Another very common weakness of studies I've found is that they're often performed in labs using fake browsers, predetermined browsing scenarios, or by having people watch videos of page loads. All of which are very disconnected from the real experience of browsing the web.Overall we should remain skeptical of studies' results when their experimental setup was questionable in those ways. While the 100ms study disproved the 100ms universality myth with just 20 people, it remains insufficient to prove that the different numbers that emerged were any more universal.

Everything sucks, now what?

Beyond magic numbers, my literature review revealed that very little real science has been done about web performance perception in general.

It is disappointing to find out that we don't know much about web performance from a scientific perspective. WPO stats might contain a lot of compelling-looking case studies, but the detailed data behind those is rarely, if ever, shared. And they're usually about how performance improvements may drive sales, without answering fundamental questions about whether things feel fast to users. Additionally, when performance improvements don't result in sales or traffic increases, they don't become a case study or something people announce proudly, which results in a self-selecting bias of industry stories of that nature.

My reaction to these disappointing findings from the literature review was to start working on original research of my own, on real Wikipedia users, as part of my work as a member of the Wikimedia Performance Team. The first results of which will be published early next year. I encourage the web performance community to do the same. The lack of science is a solvable problem, anyone can do original research and publish the data alongside the findings, so we can all make progress together on understanding how people truly perceive performance. And maybe we'll be able to come up with new guidelines based on numbers backed by science.

Photo credit: Doc Searls, Tulane Public Relations CC-BY-SA 2.0

Why performance matters

2018-12-12T16:21:17+00:00

There are practical reasons that web performance matters. From a user perspective, a site that’s slow results in frustration, annoyance, and ultimately a preference for alternatives. From the perspective of a site operator, frustrated users are users who aren’t going to return, and that makes it more difficult to accomplish your mission (be it commercial or public service). Optimizations keep people happy, keep them coming back, and keep them engaged[1].

But, there’s a far more important reason to care about performance, especially for an organization like Wikimedia: improving performance is an essential step toward equity of access.

There are a multitude of factors that influence how quickly a web site loads. Many of these are universal to every user: the software itself, the operational environment in which that software runs, the network that carries the bits from the server. Improvement in any of these areas benefits every consumer of the site.

This doesn’t account for the large number of factors that are user specific. Among the factors that can significantly influence how quickly a web page loads for a given user are geography (a user who lives further away from the servers that host a web site will typically have slower access than a user who is closer); the network between the server and the user (a network that is less developed may be slower, or more susceptible to congestion); the user’s connection (mobile data is slower than wired broadband in most cases); and the user’s actual device (an old computer will load pages more slowly than a new one).

The common thread between these factors is that they correlate to socioeconomic and social factors, rather than technical ones. Wealthier people, in more developed countries, have a significantly easier time accessing the vast resources of the Internet than others. If an increasingly networked world is going to result in a more equal human society, we need to make thoughtful interventions, including interventions focused on performance.

Geography
The correspondence of geography to socioeconomic factors manifests primarily in where servers are located. Data centers, by and large, are located in wealthier parts of wealthier countries -- places where physical and network security guarantees are high, infrastructure is reliable, and trained staff are easy to hire. This is a sensible decision by those who build and operate these facilities, but it has the unintended consequence of slowing web performance for anyone who isn’t located in a wealthier part of a wealthy country.

Backbone Networks
Backbone networks are the networks that carry traffic from servers to end users -- the highways that collectively make up the “information superhighway”. And like highways, not all are equal. Massive cables connect cities like San Francisco, Seattle, and New York; many other cities, even ones that are quite large, are served by second or third order spurs off of these primary lines. Dozens of cables traverse the North Atlantic and North Pacific; only a small handful cross any oceans South of the equator. Interior network maps are hard to come by, but in most of the world we know that smaller towns and sometimes even smaller cities are simply not connected to the Internet at all.

Last-mile connectivity
Last-mile connectivity is the way that engineers talk about the way that your computer or smartphone connects to the network. Cable internet is one form of last-mile connectivity; so is 4G cellular, or DSL. In most of the world, the last mile is the biggest bottleneck in network traffic. It’s more likely than not that the last mile is the slowest part of the entire journey from the server to your computer, regardless of where you are in the world.

However, depending on where in the world you are, “slowest” can have very different meanings. In many countries, only a tiny fraction of the population has any access to high-speed internet, whether wired or wireless. Less than 1% in Ethiopia; about 2.5% of the population in Nicaragua; 15% in Libya. Even in India, considered by many to be a key cog in the modern Internet economy, less than 25% of the population has high speed data access. Meanwhile, in Japan, the average individual has 2 broadband subscriptions. In much of Western Europe, too, the rate of broadband penetration approaches or exceeds 100%.

Device quality
The final factor that corresponds with development and socioeconomic status is device quality. Stated simply, computers are expensive, whether those computers are placed on a desk or carried in a pocket. Recent trends in software development have pushed more computation down the wire to the client. This, in turn, means that the performance difference for a site when run on a high-end versus a low-end device can be quite significant, and in some cases it’s not even possible to access sites on devices that are underpowered[2].

Though there is no single change that we can make that will address all of these factors, addressing each of them is core to serving the mission of the Wikimedia Foundation, and of the Wikimedia movement as a whole.

One ongoing element of this work is research to understand the actual factors that influence user perception of performance, and the way that user satisfaction is impacted when a page loads slowly. This allows us to make data-driven decisions about where to spend our time and our energy.

We’ve shown that expanding our cache footprint can help to minimize the effects of geography. This gives us a way to address the imbalances that result from immutable physics.

We’re not in a position to address inequality of backbone or last-mile network infrastructure -- that’s something best left to telecom companies, governments, or non-profit organizations that have chosen that as their work. What we can do is to minimize the effects of these disparities by reducing the number of bytes that need to go down the wire in order to display a page, by exploring technologies like peer-to-peer distribution to eliminate them altogether, or by increasing usage of offline content that can be downloaded in bulk using public high-speed connections.

Finally, we can aggressively work to lower the compute cost of each page that we serve, so that the cost or the age of a user’s device doesn’t impact their ability to read, learn, and contribute to the world of free knowledge.

Performance engineering matters, in other words, because it gives us a way to eliminate technological divides that are otherwise difficult, expensive, or even impossible to address at a systemic level.

[1] http://engineroom.ft.com/2016/04/04/a-faster-ft-com/ is a great breakdown of the implications of performance on content consumption, based on the experience of the Financial Times as they were developing a new website. https://medium.com/@vikigreen/impact-of-slow-page-load-time-on-website-performance-40d5c9ce568a aggregates a number of different studies that illustrate the financial implications of slow page-load performance for commercial websites.

[2] A number of years ago, Chris Zacharias, formerly an engineer at Youtube, published an anecdote about the creation of a very lightweight video display page. When they launched it to a subset of traffic, the result was that measured page performance got worse, a surprising result when the page was significantly smaller. In the end it turned out that this happened because it was suddenly possible to load the player on low-powered devices and in less-connected geographies -- previously those data hadn’t been included at all because Youtube was entirely inaccessible at any speed.

Perf Matters at Wikipedia in 2015

2019-02-14T16:05:00+00:00

Hello, WANObjectCache

This year we achieved another milestone in our multi-year effort to prepare Wikipedia for serving traffic from multiple data centres.

The MediaWiki application that powers Wikipedia relies heavily on object caching. We use Memcached as horizontally scaled key-value store, and we’d like to keep the cache local to each data centre. This minimises dependencies between data centres, and makes better use of storage capacity (based on local needs).

Aaron Schulz devised a strategy that makes MediaWiki caching compatible with the requirements of a multi-DC architecture. Previously, when source data changed, MediaWiki would recompute and replace the cache value. Now, MediaWiki broadcasts “purge” events for cache keys. Each data centre receives these and sets a “tombstone”, a marker lasting a few seconds that limits any set-value operations for that key to a miniscule time-to-live. This makes it tolerable for recache-on-miss logic to recompute the cache value using local replica databases, even though they might have several seconds of replication lag. Heartbeats are used to detect the replication lag of the databases involved during any re-computation of a cache value. When that lag is more than a few seconds (a large portion of the tombstone period), the corresponding cache set-value operation automatically uses a low time-to-live. This means that large amounts of replication lag are tolerated.

This and other aspects of WANObjectCache’s design allow MediaWiki to trust that cached values are not substantially more stale, than a local replica database; provided that cross-DC broadcasting of tiny in-memory tombstones is not disrupted.

First paint time now under 900ms

In July we set out a goal: improve page load performance so our median first paint time would go down from approximately 1.5 seconds to under a second – and stay under it!

I identified synchronous scripts as the single-biggest task blocking the browser, between the start of a page navigation and the first visual change seen by Wikipedia readers. We had used async scripts before, but converting these last two scripts to be asynchronous was easier said than done.

There were several blockers to this change. Including the use of embedded scripts by interactive features. These were partly migrated to CSS-only solutions. For the other features, we introduced the notion of “delayed inline scripts”. Embedded scripts now wrap their code in a closure and add it to an array. After the module loader arrives, we process the closures from the array and execute the code within.

Another major blocker was the subset of community-developed gadgets that didn’t yet use the module loader (introduced in 2011). These legacy scripts assumed a global scope for variables, and depended on browser behaviour specific to serially loaded, synchronous, scripts. Between July 2015 and August 2015, I worked with the community to develop a migration guide. And, after a short deprecation period, the legacy loader was removed.

Hello, WebPageTest

Previously, we only collected performance metrics for Wikipedia from sampled real-user page loads. This is super and helps detect trends, regressions, and other changes at large. But, to truly understand the characteristics of what made a page load a certain way, we need synthetic testing as well.

Synthetic testing offers frame-by-frame video captures, waterfall graphs, performance timelines, and above-the-fold visual progression. We can run these automatically (e.g. every hour) for many urls, on many different browsers and devices, and from different geo locations. These tests allow us to understand the performance, and analyse it. We can then compare runs over any period of time, and across different factors. It also gives us snapshots of how pages were built at a certain point in time.

The results are automatically recorded into a database every hour, and we use Grafana to visualise the data.

In 2015 Peter built out the synthetic testing infrastructure for Wikimedia, from scratch. We use the open-source WebPageTest software. To read more about its operation, check Wikitech.

The journey to Thumbor begins

Gilles evaluated various thumbnailing services for MediaWiki. The open-source Thumbor software came out as the most promising candidate.

Gilles implemented support for Thumbor in the MediaWiki-Vagrant development environment.

To read more about our journey to Thumbor, read The Journey to Thumbor (part 1).

Save timing reduced by 50%

Save timing is one of the key performance metrics for Wikipedia. It measures the time from when a user presses “Publish changes” when editing – until the user’s browser starts to receive a response. During this time, many things happen. MediaWiki parses the wiki-markup into HTML, which can involve page macros, sub-queries, templates, and other parser extensions. These inputs must be saved to a database. There may also be some cascading updates, such as the page’s membership in a category. And last but not least, there is the network latency between user’s device and our data centres.

This year saw a 50% reduction in save timing. At the beginning of the year, median save timing was 2.0 seconds (quarterly report). By June, it was down to 1.6 seconds (report), and in September 2015, we reached 1.0 seconds! (report)

The effort to reduce save timing was led by Aaron Schulz. The impact that followed was the result of hundreds of changes to MediaWiki core and to extensions.

Deferring tasks to post-send

Many of these changes involved deferring work to happen post-send. That is, after the server sends the HTTP response to the user and closes the main database transaction. Examples of tasks that now happen post-send are: cascading updates, emitting “recent changes” objects to the database and to pub-sub feeds, and doing automatic user rights promotions for the editing user based on their current age and total edit count.

Aaron also implemented the “async write” feature in the multi-backend object cache interface. MediaWiki uses this for storing the parser cache HTML in both Memcached (tier 1) and MySQL (tier 2). The second write now happens post-send.

By re-ordering these tasks to occur post-send, the server can send a response back to the user sooner.

Working with the database, instead of against it

A major category of changes were improvements to database queries. For example, reducing lock contention in SQL, refactoring code in a way that reduces the amount of work done between two write queries in the same transaction, splitting large queries into smaller ones, and avoiding use of database master connections whenever possible.

These optimisations reduced chances of queries being stalled, and allow them to complete more quickly.

Avoid synchronous cache re-computations

The aforementioned work on WANObjectCache also helped a lot. Whenever we converted a feature to use this interface, we reduced the amount of blocking cache computation that happened mid-request. WANObjectCache also performs probabilistic preemptive refreshes of near-expiring values, which can prevent cache stampedes.

Profiling can be expensive

We disabled the performance profiler of the AbuseFilter extension in production. AbuseFilter allows privileged users to write rules that may prevent edits based on certain heuristics. Its profiler would record how long the rules took to inspect an edit, allowing users to optimise them. The way the profiler worked, though, added a significant slow down to the editing process. Work began later in 2016 to create a new profiler, which has since completed.

And more

Lots of small things. Including the fixing of the User object cache which existed but wasn’t working. And avoid caching values in Memcached if computing them is faster than the Memcached latency required to fetch it!

We also improved latency of file operations by switching more LBYL-style coding patterns to EAFP-style code. Rather than checking whether a file exists, is readable, and then checking when it was last modified – do only the latter and handle any errors. This is both faster and more correct (due to LBYL race conditions).

So long, Sajax!

Sajax was a library for invoking a subroutine on the server, and receiving its return value as JSON from client-side JavaScript. In March 2006, it was adopted in MediaWiki to power the autocomplete feature of the search input field.

The Sajax library had a utility for creating an XMLHttpRequest object in a cross-browser-compatible way. MediaWiki deprecated Sajax in favour of jQuery.ajax and the MediaWiki API. Yet, years later in 2015, this tiny part of Sajax remained popular in Wikimedia's ecosystem of community-developed gadgets.

The legacy library was loaded by default on all Wikipedia page views for nearly a decade. During a performance inspection this year, Ori Livneh decided it was high time to finish this migration. Goodbye Sajax!

Machine learning: how to undersample the wrong way

2018-10-15T13:37:07+00:00

For the past couple of months, in collaboration with researchers, I've been applying machine learning to RUM metrics in order to model the microsurvey we've been running since June on some wikis. The goal being to gain some insight into which RUM metrics matter most to real users.

Having never done any machine learning before, I did a few rookie mistakes. In this post I'll explain the biggest one, which led us to believe for some time that we had built a very well-performing model.

Class imbalance

The survey we're collecting user feedback with has a big class imbalance issue when it comes to machine learning. A lot more people are happy about the performance than people who are unhappy (a good problem to have, for sure!). In order to build a machine learning model that works, we used a common strategy to address this: undersampling. The idea is that in a binary classification, if you have too many of one of the two values, you just discard the excess data for that type.

Sounds simple, right? in Python/pandas it looks something like this:

dataset.sort_values(by=[column_prefix + 'response'], inplace=True)
negative_responses_count = dataset[column_prefix + 'response'].value_counts()[-1]
dataset = dataset.head(n=int(negative_responses_count) * 2)

Essentially we sort by value, with the ones we have the least values for at the top, then we used head() to get the first N records, where N is twice the amount of negative survey responses. With this, we should end up with exactly the same amount of rows for each value (negative and positive response). So far so good.

Then we apply our machine learning algorithm to the dataset (for example, for a binary classification of this kind, random forest is a good choice). At first the results were poor, and then we added a basic feature we forgot to include: time. Time of day, day of the week, day of the year, etc. When adding these, things started to work incredibly well! Surely we discovered something groundbreaking about seasonality/time-dependence in this data. Or...

I've made a huge mistake

A critical mistake was made in the above code snippet. The original dataset has chronological records. When we sort by "response" value, this chronological order remains, within the context of each sorted section of the dataset.

We have to perform undersampling because we have too many positive survey responses over the full timespan. We start by keeping all the negative responses, which happen over the full timespan. But we only keep the first N positive responses... which, due to the chronological ordering of records, come from a much shorter timespan. In the same dataset we end up with rows that contain negative responses ranging for example from June 1st to October 1st. And positive responses only ranging from June 1st to June 15th, for instance.

The reason why the model started giving excellent results when we introduced time as a feature, is that it basically detected the date discrepancy in our dataset! It's pretty easy to guess that a response is likely positive if you look at its date. If the date is later than June 15th, everything in our dataset is negative responses... Our machine learning model just started excelling at detecting our mistake :)

A simple solution

The workaround for this issue is simply to pick N positive responses at random over the whole timespan when undersampling, to make sure that the dataset is consistent:

dataset.sort_values(by=[column_prefix + 'response'], inplace=True)
negative_responses = dataset.head(n=int(negative_responses_count))
positive_responses = dataset.tail(n=int(dataset.shape[0] - negative_responses_count))
positive_responses = shuffle(positive_responses).head(n=int(negative_responses_count))
dataset = pandas.concat([negative_responses, positive_responses])

This way we ensure that we're not introducing a time imbalance when working around our class imbalance.

Best friends forever

2018-10-03T09:43:53+00:00

We use both synthetic and RUM testing for Wikipedia. These two ways of testing performance are best friends and help us verify regressions. Today, we will look at two regressions where it helped us to get metrics both ways.

In our synthetic lab environment, we update the browser version in a controlled way. When there’s a new browser release, we wait for a new Docker container with the latest version. We stop the current tests, update the Docker container and restart the tests and look at the metrics that we graph in Grafana. That way, we can check whether a new browser version introduced a regression.

Our users’ browsers usually upgrade slowly. The browser vendors usually push the browser to a percentage of users first, and then give the green light to update all of them. When we collect performance metrics, we also collect browser names and versions. That way we can see when users pick up a new browser version and if that version has any impact on our metrics. The adoption of new versions by real users takes time, and when we see a regression in our synthetic testing, it can take a couple of weeks until we see the same effect in our user metrics.

Chrome 67

When we pushed Chrome 67 we noticed a regression in our first visual change synthetic testing (T196242).

Here you can see what it looked like for our test of https://en.wikipedia.org/wiki/Facebook. The blue vertical line is when we pushed the new browser version.

Most of the pages we test were affected, but not all of them. For our tests of the "Barack Obama" English Wikipedia article, it was hard to spot any change at all.

We could only see the change on desktop. And we could verify the regression in both of our synthetic tools (WebPageTest and Browsertime/WebPageReplay). Could it be some content that causes that regression, since it only affected some pages?

Next step

When we see a regression, we always first try to rule out that it has something to do with a change we have done. If the regression happens when we update the browser in our tests, it’s easy: we roll back the browser version and collect new metrics to see if the metrics goes back down. And then we update to the new version again. In this case, we confirmed it was only the browser causing our first visual change metric to jump. (Not a change in our content.)

When we find a browser regression, we try to collect as much data as possible and file an upstream issue. In this case it became Chromium issue 849108.

The next step is to see if we can find the same change in the metrics that we collect directly from users. The firstPaint metric in Chrome is similar to the first visual change metric we use in our synthetic testing. Which means that when we have enough traffic coming from Chrome 67, we should be able to see the change on first paint.

The conversion rate from Chrome 66 to 67 looked like this:

If you look real closely, you can see that around the 15th of June we started getting enough traffic for Chrome 67 to see the effect on our metrics.

To see the change in Chrome, we look at the metrics we collect from all versions of Chrome and check the median and 75th percentile of first paint.

In the following graph, we take the average over one day to try to minimize spikes. If you look at the right side (Chrome 67) of the graphs you can see that it has a slightly higher first paint than to the left (Chrome 66).

To verify the metrics, we also looked at first paint on mobile. There’s no regression there, it rather looks like there could be a small win in first paint.

To be 100% sure that there’s nothing we introduced, we take another look at synthetic testing at that time when the increase in first paint was seen for real users (15th of June).

There’s no increase in the metrics from synthetic tests at that time. This confirms it was a (small) regression in Chrome 67.

Chrome 69

Some time ago, our performance alerts in Grafana fired about first paint in Chrome having increased for our users. We looked at it, and couldn’t find an exact issue that could have caused it. It looked like the metric had slowly increased over time. That type of regression are always the hardest to deal with because it’s hard to see exactly what’s causing the regression.

We could see the regression both on desktop and mobile. It was most obvious when we checked the first paint on mobile. You can see the weekly pattern we have but the highs are getting higher and higher.

But we actually had the answer: When we updated to Chrome 69 in our synthetic testing a couple of weeks ago, we again saw an increase in first visual change. This time, we could see the regression on some wikis but not all of them. We’ve switched back and forth between Chrome 68 and 69 and first visual change for the Japanese wiki looked like this:

This time, it seems like a bigger impact on first visual change. We track this issue in T203543 and filed an upstream bug with Chromium.

Is this the same regression as we see in RUM? Let us look again at when the majority of Chrome users switched from 68 to 69.

And then let’s go back to first paint metric. First, we look at our metric for desktop only. Around September 22nd almost all traffic was from 69, but you can also see that it was introduced in early September.

It looks like when Chrome 69 was introduced, first paint slowly rose and then when all our metrics were collected from 69, both median and 75th percentile were higher than with 68.

What does it look like for mobile?

We see the same pattern here. Let us check our synthetic testing at the same time, to see if we could have introduced a code change that affected first visual change. Our metrics on mobile are even more stable than desktop. We look at Swedish Wikipedia, because we didn’t deploy any change on that test server during this period.

No regression there. It looks like this also could be a performance regression in Chrome.

Summary

Working with both synthetic metrics and metrics from real users, helps us to confirm issues. In this case, it helped us to find two browser regressions that impact our users. We hope that we can get help from the Chromium team to resolve these issues.

Performance testing in a controlled lab environment - the metrics

2018-09-21T07:49:33+00:00

One of the Performance Team responsibilities at Wikimedia is to keep track of Wikipedias performance. Why is performance important for us? In our case it is easy: We have so many users and if we have a performance regression, we are really affecting people's lives. Maybe you remember our hiring tweet from a couple of years ago?

If we are slow, we waste users time. And we don’t want to do that. That's why we are really serious about the performance of Wikipedia.

Performance metrics

Timo told us that there are two ways of collecting web performance metrics: Directly from our users, called real user measurements (RUM) or in a controlled lab environment (synthetic testing).

Getting metrics from real users is good because they are close to what people really experience. The problem is that:

Today's browsers have limited ways of giving us metrics that tell us what the user is experiencing. A couple of browsers have a metric called First Paint (when something is first painted on the screen). But the rest of the metrics are more technical. They tell us how the browser is doing, not what the user is experiencing. Browser vendors are working on this but, at the moment, most performance APIs are focused on technical metrics. And we’re more interested in what our users are experiencing.

Metrics from real users have a lot of variation because the users have different conditions: network latency, operating system, browser version, CPU and more. If something changes, how do we know the main reason? Is it our code? Is it something else?

There is a lot of noise in the data we collect from real users. To catch a performance regression, it needs to be big enough and affect many users to be visible.

That’s why we also test in a lab environment.

Performance testing in a lab

Testing in a lab means that we are running a desktop computer (or mobile) in an isolated environment where we try to have the environment as stable as possible. That way we hope that we are able to pick up small performance regressions and know why we have that regression.

In a lab environment we have more control and we love control when we want to measure things! We want to make the conditions between tests to be as similar as possible. If we can control the environment, we can find regressions introduced by our code changes.

Testing in a lab environment helps us with:

Collecting metrics that are more related to user experience than the technical metrics the browsers provide. In this post we will focus on the specific metrics we get from the lab environment.

In a controlled environment that typically provides consistent test results, it is easier to detect regressions. It’s easier to spot smaller regressions than with RUM.

Lab testing and RUM are best friends: our synthetic testing improves our confidence in RUM and vice versa. If we see a regression in both types of measurements, we know for sure that it's a real regression.

We can see changes very quickly when testing in a lab setting, because the only variable that is changing is the code. Being confident in RUM usually takes more time: you need to have more data to see the change.

In a lab setting, we are also able to control when browser versions are updated to new versions, allowing us to detect when browser releases impact user experience. Our users’ browsers auto updates. A new browser version can affect the performance.

But everything isn’t perfect when testing in a lab: We are only testing a small usage group (a specific browser version, specific connection type, one operating system, a couple of URLs). We will miss out on a lot of users scenarios. That’s the big disadvantage of testing in a lab. That’s why it is also important to collect metrics from real users.

Metrics in our lab

One of the things I like with testing in a lab is that we can get more user centered metrics than we can get directly from the browser. We mostly focus on visual metrics.

Visual metrics

Visual metrics are when things that happens within the viewport of the browser that the user can see. We collect these metrics by recording a video of the browser screen and then analyzing the video and calculating the metrics.

What’s good with visual metrics is that it is easy to see and understand and easy to relate. They are the best metrics we have today to know what the user is experiencing.

However visual metrics doesn't tell the full story: Only focusing on visuals we miss out on when the page “feels” ready. What do we mean by "feel"? JavaScript that gets executed after the screen is painted can make the page feel slow/laggy. You try to interact with the page but nothing happens. Today we don’t have a good way to measure that “feeling”, but there have been different attempts in the performance industry to fix that. There is ongoing work on metrics like Time To Interaction and other interaction metrics trying to know when it is possible for the user to interact with the page. But at the moment no browser supports them natively and, in my opinion, these metrics are not yet mature enough to use.

First Visual Change

First visual change is when something is first painted within the viewport of the browser screen. For Wikipedia on desktop this mostly means going from a complete blank screen to something like this:

Depending on your internet connection the first visual change may not include the image. Other than that, this is what the first change looks like for most users. If you have another case, please submit an issue in Phabricator with screenshots and your setup so we can reproduce and add more test cases.

On mobile the first visual change looks something like this:

Showing content (text/images) early is important, since the earlier you can see the text as a reader, the sooner you can start to read and get the information you want.

First visual change also correlates to the RUM metric first paint. We can see that the first visual change and first paint happen at almost the same time. That means that if we see a change in first paint in RUM, we will also see the change in first visual change in synthetic testing.

Speed Index

Speed Index was invented by Patrick Meenan, the creator of WebPageTest:

“The Speed Index is the average time at which visible parts of the page are displayed. It is expressed in milliseconds and dependent on size of the viewport.”

The idea with Speed Index is to measure when the entire content within the browser viewport is ready. Speed Index will be lower if the entire page renders more quickly, but it will also be lower if partial content renders sooner. If two pages finish rendering at the same time, the page that started to render first will have the lowest Speed Index.

We use Speed Index to know when when the page looks ready for the user. However the page can still download assets and run JavaScript, so it could be that the page looks ready but doesn’t feel ready (it’s not responsive if you to click on a link).

Last Visual Change

Last visual change is when everything within the viewport is ready on the screen. It could be images coming in late or JavaScript in one extension changing the layout of the already painted screen.

Here’s an example video. Look to the right of the screen and you will see the map, the sound player and that small coordinates image coming in really late.

We don't act on last visual change today because it is really dependent on what page we test, what browser we use to test the page, and other factors. But it is an important metric: if the page starts to change when you want to interact with it, you will not know if the page is ready to use or not.

Visual Complete 85/95/99

We also collect metrics when the page is 85, 95 and 99% complete. The intent of these metrics is to account for content-related factors that result in changes to overall page load times. For example, when Wikipedia is running fundraising campaigns, there is a donation banner that appears on most pages on the site. These donation banners are intentionally loaded after most of the content has loaded, but they do result in the last visual change being pushed back quite a bit. By measuring the time that it takes for the page to be 85% complete instead, we avoid the variation that this introduces.

Visual Elements

The last one of our visual metrics are the ability to measure when specific elements is painted on the screen. We can get metrics like when is the largest image painted? When is the logo? We have started to collect these metrics in our tools but haven't yet started to act on them.

CPU time spent metrics

Another type of metric that we use in the lab is CPU time spent in the browser. Today we can only get that from Chrome. We configure and collect the event trace log from Chrome and categorise the events on a high level.

We can then see time spent on painting, rendering, scripting and loading. If we get a regression, we can then go down to a lower event level and use that to better understand what is actually happening.

Deviation

One important meta metric that we also collect is the standard deviation of each metric that we collect. That helps us know how stable the metrics we have are and if we need to tune anything in our setup. It also helps us understand if we have certain pages that are extra hard to measure (that have a lot of variation). This is something that I’ll talk about more in my next post, when we discuss the ways that we control our synthetic testing environment.

Page snapshots

Our synthetic testing also helps us with knowing how our pages were constructed at a given time. We collect number of requests, the size of each response, the response type and how the browser downloaded all responses every time we measure a page. That helps us when we find a regression. We can compare the before and after (what did the page look like before the regression?).

We also collect screenshots and videos of what the page looked like at the time we measured the page.

Summary

We focus on visual metrics and CPU time spent metrics in our synthetic testing. The visual metrics give us a better feel for what the user is experiencing than most of the metrics we collect from real users. Our synthetic testing also gives us a snapshot of what the page looked like at the moment we measured it. The problem with synthetic measuring is that we try out only a small portion of users’ different setups.

In the next blog post I will talk about our technical setup, problems we have had, and strategies we are using to get stable metrics.

Mobile web performance: the importance of the device

2018-06-22T14:19:10+00:00

This week at our team offsite in Dublin, I looked at our performance data from an angle we haven't explored before: mobile device type. Most mobile devices expose their make and model in the User Agent string, which allows to look at data for a particular type of device. As per our data retention guidelines, we only keep user agent information for 90 days, but that's already plenty of data to draw conclusions.

I looked at the top 10 mobile devices accessing our mobile sites, per country, for the past week. One country in particular, India, had an interesting set of top 10 devices that included two models from different hardware generations. The Samsung SM-J200G, commercially known as the Samsung Galaxy J2, which was the 5th most common mobile device accessing our mobile sites. And the Samsung SM-G610F, also known as the Samsung Galaxy J7 Prime, which was the 2nd most common. The hardware of the more recent handset is considerably more powerful, with 3 times the RAM, 23% faster CPU clock and twice the amount of CPU cores than the older model.

Being in the top 10 for that country, both devices get a lot of traffic in India, which means a lot of performance Real User Monitoring data collected from real clients to work with.

With the J7 Prime retail price in India currently being double the J2 retail price, one might wonder if users who use the cheaper phone also use a cheaper, slower, internet provider.

Thanks to the Network Information API, which we recently added to the performance data we collect, we are able to tell.

Looking at Chrome Mobile only, for the sake of having a consistent definition of the effectiveType buckets, we get:

effectiveType	`J2`	`J7 Prime`
slow-2g	0.5%	1.1%
2g	0.8%	0.7%
3g	27%	28%
4g	71.5%	70.2%

These breakdowns are extremely similar, which strongly suggests that users of these two phone models in India actually experience the same internet connectivity quality. This is very interesting, because it gives us the ability to compare the performance of these two devices from different hardware generations, in the real world, with connectivity quality as a whole that looks almost identical. And similar latency, since they're connecting to our data centers from the same country.

What does firstPaint look like for these users, then?

Device	Sample size	Median	p90	p95	p99
`J2`	1226	1842	4769	7704	15957
`J7 Prime`	1798	1082	2811	5076	12136
difference		-41.3%	-41.1%	-34.2%	-24%

And what about loadEventEnd?

Device	Sample size	Median	p90	p95	p99
`J2`	1226	3078	9813	14072	29240
`J7 Prime`	1798	1821	5635	9847	28949
difference		-40.9%	-42.6%	-30.1%	-1.1%

Across the board, the difference is huge, even for metrics like loadEventEnd when one might think that download speed might be an equalizer, particularly since we serve some heavy pages when articles are long. OS version might play a part in addition to hardware, but in practice we see that older Android devices tend to stick to the OS version they were shipped with, which means that those two factors are tied together. For example, worldwide for the past week, 100% of J2 phones run the Android version they were shipped with (5.1).

These results show that device generation has a huge impact on the real performance experienced by users. Across the globe, users are upgrading their devices over time. This phenomenon means that the performance metrics we measure directly on sampled users with RUM should improve over time, by virtue of people getting more powerful devices on average. This is an important factor to keep in mind when measuring the effect of our own performance optimizations. And when the median of the RUM metrics stay stable over a long period of time, it might be that our performance is actually worsening, and that degradation is being masked by device and network improvements across the board.

Given the eye-opening results of this small study, getting a better grasp on the pace of improvement of the environment (device generations, network) looks like a necessity to understand and validate our impact on the evolution of RUM metrics.

Thumbor support for private wikis deployed

2018-02-22T10:34:43+00:00

Yesterday we deployed Thumbor support for Wikimedia-hosted private wikis. While 99.9% of our traffic is for public-facing wikis, the Wikimedia Foundation hosts a number of private MediaWiki instances on the same infrastructure. Those wikis facilitate work for various groups in the movement, from community-run projects like OTRS, to local chapters, staff or the board. They're essential to the Wikimedia Movement, but by being private they're an architectural special case.

When we migrated all public thumbnail traffic to using Thumbor as the rendering backend last June, it would have been easy to claim the job done and move onto something else, turning a blind eye to the special case of private wikis. But their different setup meant that they were still using the MediaWiki-based thumbnailing cluster. A clear waste of resources to have a whole (reduced, but still multi-machine) cluster dedicated to a special case representing so little traffic. And more importantly, it meant that for tasks like security work or software upgrades, we would have two clusters to care about for image processing, the new Thumbor one and the legacy MediaWiki image scaling. With very different testing involved for each.

What makes thumbnailing different for private wikis, is that like any content on them, images are meant to be only viewed by people with access to those wikis. For public wikis, authentication isn't required, and that's what lets us have a more streamlined stack that doesn't hit MediaWiki. Public wiki thumbnails are highly cached in Varnish. For private wikis, MediaWiki's authentication acts as the gatekeeper to let a client view a thumbnail. Varnish doesn't cache the thumbnails of private wikis, and merely forwards the request to MediaWiki.

With the new system deployed yesterday, when MediaWiki receives such requests for a new thumbnail on a private wiki, instead of rendering it like it used to, it proxies the request to the same Thumbor cluster used by public wikis, which takes care of the rendering. Some additional gatekeeping is in place in Thumbor to ensure that requests coming from the public wiki pipeline cannot access images that belong to private wikis. Essentially, rendering is now centralized on the single Thumbor cluster, which takes care of both worlds, while still keeping Thumbor decoupled from MediaWiki authentication (since for security reasons, we don't want Thumbor to interact with MediaWiki databases).

Bar any unforeseen issues while we keep an eye on potential bugs in the coming months, we will most likely retire the MediaWiki-based image scaling cluster this year, therefore truly concluding the migration of all our thumbnail rendering across our entire infrastructure to Thumbor.

Sometimes it takes a lot of extra work to tackle those special cases, which can feel like a chore after having switched 99.9% of the traffic already. But the cost of keeping a legacy system running for a special case cannot be overlooked. Beyond keeping a cluster of mostly idle machines in two data centers, the duplicated work of maintaining things is also expensive and never really quantified. Reaching true completion and decommissioning a legacy cluster feels great, though, it's really worth putting in the extra effort!

Measuring Wikipedia page load times

2018-01-09T18:25:45+00:00

This post shows how we measure and interpret load times on Wikipedia. It also explains what real-user metrics are, and how percentiles work.

Navigation Timing

When a browser loads a page, the page can include program code (JavaScript). This program will run inside the browser, alongside the page. This makes it possible for a page to become dynamic (more than static text and images). When you search on Wikipedia.org, the suggestions that appear are made with JavaScript.

Browsers allow JavaScript to access some internal systems. One such system is Navigation Timing, which tracks how long each step takes. For example:

How long to establish a connection to the server?
When did the response from the server start arriving?
When did the browser finish loading the page?

Where to measure: Real-user and synthetic

There are two ways to measure performance: Real user monitoring, and synthetic testing. Both play an important role in understanding performance, and in detecting changes.

Synthetic testing can give high confidence in change detection. To detect changes, we use an automated mechanism to continually load a page and extract a result (eg. load time). When there is a difference between results, it likely means that our website changed. This assumes other factors remained constant in the test environment. Factors such as network latency, operating system, browser version, and so on.

This is good for understanding relative change. But synthetic testing does not measure the performance as perceived by users. For that, we need to collect measurements from the user’s browser.

Our JavaScript code reads the measurements from Navigation Timing, and sends them back to Wikipedia.org. This is real-user monitoring.

How to measure: Percentiles

Imagine 9 users each send a request: 5 users get a result in 5ms, 3 users get a result in 70ms, and for one user the result took 560ms. The average is 88ms. But, the average does not match anyone’s real experience. Let’s explore percentiles!

The first number after the lower half (or middle) is the median (or 50th percentile). Here, the median is 5ms. The first number after the lower 75% is 70ms (75th percentile). We can say that "for 75% of users, the service responded within 70ms". That’s more useful.

When working on a service used by millions, we focus on the 99th percentile and the highest value (100th percentile). Using medians, or percentiles lower than 99%, would exclude many users. A problem with 1% of requests is a serious problem. To understand why, it is important to understand that, 1% of requests does not mean 1% of page views, or even 1% of users.

A typical Wikipedia pageview makes 20 requests to the server (1 document, 3 stylesheets, 4 scripts, 12 images). A typical user views 3 pages during their session (on average).

This means our problem with 1% of requests, could affect 20% of pageviews (20 requests x 1% = 20% = ⅕). And 60% of users (3 pages x 20 objects x 1% = 60% ≈ ⅔). Even worse, over a long period of time, it is most likely that every user will experience the problem at least once. This is like rolling dice in a game. With a 16% (⅙) chance of rolling a six, if everyone keeps rolling, everyone should get a six eventually.

Real-user variables

The previous section focussed on performance as measured inside our servers. These measurements start when our servers receive a request, and end once we have sent a response. This is back-end performance. In this context, our servers are the back-end, and the user’s device is the front-end.

It takes time for the request to travel from the user’s device to our systems (through cellular or WiFi radio waves, and through wires.) It also takes time for our response to travel back over similar networks to the user’s device. Once there, it takes even more time for the device’s operating system and browser to process and display the information. Measuring this is part of front-end performance.

Differences in back-end performance may affect all users. But, differences in front-end performance are influenced by factors we don’t control. Such as network quality, device hardware capability, browser, browser version, and more.

Even when we make no changes, the front-end measurements do change. Possible causes:

Network. ISPs and mobile network carriers can make changes that affect network performance. Existing users may switch carriers. New users come online with a different choice distribution of carrier than current users.
Device. Operating system and browser vendors release upgrades that may affect page load performance. Existing users may switch browsers. New users may choose browsers or devices differently than current users.
Content change. Especially for Wikipedia, the composition of an article may change at any moment.
Content choice. Trends in news or social media may cause a shift towards different (kinds of) pages.
Device choice. Users that own multiple devices may choose a different device to view the (same) content.

The most likely cause for a sudden change in metrics is ourselves. Given our scale, the above factors usually change only for a small number of users at once. Or the change might happen slowly.

Yet, sometimes these external factors do cause a sudden change in metrics.

Case in point: Mobile Safari 9

Shortly after Apple released iOS 9 (in 2015), our global measurements were higher than before. We found this was due to Mobile Safari 9 introducing support for Navigation Timing.

Before this event, our metrics only represented mobile users on Android. With iOS 9, our data increased its scope to include Mobile Safari.

iOS 9, or the networks of iOS 9 users, were not significantly faster or slower than Android’s. The iOS upgrade affected our metrics because we now include an extra 15% of users – those on Mobile Safari.

Where desktop latency is around 330ms; mobile latency is around 520ms. Having more metrics from mobile, skewed the global metrics toward that category.

The above graphs plot the "75th percentile" of responseStart for desktop and mobile (from November 2015). We combine these metrics into one data point for each minute. The above graphs show data for one month. There is only enough space on the screen to have each point represent 3 hours. This works by taking the mean average of the per-minute values within each 3 hour block. While this provides a rough impression, this graph does not show the 75th percentile for November 2015. The next section explains why.

Average of percentiles

Opinions vary on how bad it is to take the average of percentiles over time. But one thing is clear: The average of many 1-minute percentiles is not the percentile for those minutes. Every minute is different, and the number of values also varies each minute. To get the percentile for one hour, we need all values from that hour, not the percentile summary from each minute.

Below is an example with values from three minutes of time. Each value is the response time for one request. Within each minute, the values sort from low to high.

The average of the three separate medians is 211ms. This is the result of (5 + 560 + 70) / 3. The actual median of these values combined, is 70ms.

Buckets

To compute the percentile over a large period, we must have all original values. But, it’s not efficient to store data about every visit to Wikipedia for a long time. We could not quickly compute percentiles either.

A different way of summarising data is by using buckets. We can create one bucket for each range of values. Then, when we process a time value, we only increment the counter for that bucket. When using a bucket in this way, it is also called a histogram bin.

Let’s process the same example values as before, but this time using buckets.

Based on the total count (19) we know that the median (10th value) must be in bucket B, because bucket B contains values 10 to 13. And that the 75th percentile (15th value) must be in bucket C because it contains values 14 to 19.

We cannot know the exact millisecond value of the median, but we know the median must be between 11ms and 100ms. (This matches our previous calculation, which produced 70ms.)

When we use exact percentiles, our goal was for that percentile to be a certain number. For example, if our 75th percentile today is 560ms, this means for 75% of users a response takes 560ms or less. Our goal could be to reduce the 75th percentile to below 500ms.

When using buckets, goals are defined differently. In our example, 6 out of 19 responses (32%) are above 100ms (bucket C and D), and 13 of 19 (68%) are below 100ms (bucket A and B). Our goal could be to reduce the percentage of responses above 100ms. Or the opposite, to increase the percentage of responses within 100ms.

Rise of mobile

Traffic trends are generally moving towards mobile. In fact, April 2017 was the first month where Wikimedia mobile pageviews reached 50% of all Wikimedia pageviews. And after June 2017, mobile traffic has stayed above 50%.

Global changes like this have a big impact on our measurements. This is the kind of change that drives us to rethink how we measure performance, and (more importantly) what we monitor.

The journey to Thumbor, part 3: development and deployment strategy

2017-11-20T12:42:50+00:00

In the last blog post I described where Thumbor fits in our media thumbnailing stack. Introducing Thumbor replaces an existing service, and as such it's important that it doesn't preform worse than its predecessor. We came up with a strategy to reach feature parity and ensure a launch that would be invisible to end users.

Development

In Wikimedia production, Thumbor was due to interact with several services: Varnish, Swift, Nginx, Memcached, Poolcounter. In order to iron out those interactions, it was important to reproduce them locally during development. Which is why I wrote several roles for the official MediaWiki Vagrant machine, with help from @bd808. Those have already been useful to other developers, with several people reaching out to me about the Varnish and Swift Vagrant roles. While at the time it might have seemed like an unnecessary quest (why not develop straight on a production machine?) it was actually a great learning experience to write the extensive Puppet code required to make it work. While it's a separate codebase, subsequent work to port that over to production Puppet was minimal.

This phase actually represented the bulk of the work, reproducing support for all the media formats and special parameters found in Mediawiki thumbnailing. I dedicated a lot of attention to making sure that the images generated by Thumbor were as good as what MediaWiki was outputting for the same original media. In order to do that, I wrote many integration tests using thumbnails from Wikimedia production, which were used as reference output. Those tests are still part of the Thumbor plugins Debian package and ensure that we avoid regressions. They use a DSSIM algorithm to visually compare images and make sure that what Thumbor outputs doesn't visually diverge from the reference thumbnails. We also compare file size to make sure that the new output isn't significantly heavier than the old.

Packaging

The next big phase of the project was to create a Debian package for our Thumbor code. I had never done that before and it wasn't as difficult as some people make it out to be (I imagine the tooling has gotten significantly better than it used to be), at least for Python packages. However, in order to be able to ship our code as a Debian package, Thumbor itself needed to have a Debian package. Which wasn't the case at the time. Some people had tried on much older versions of Thumbor but never reached the point where it was put in Debian proper. Since that last attempt, Thumbor added a lot of new dependencies that weren't packaged either. @fgiunchedi and I worked on packaging it all and successfully did so. And with the help of Debian developer Marcelo Jorge Vieira who pushed most of those packaged for us into Debian, we crossed the finish line recently and got Thumbor submitted to Debian unstable.

One advantage of doing this is that it makes deployment of updates really straightforward, with the integration test suite I mentioned earlier running in isolation when the Debian package is built. With those Debian packages done, we were ready to run this on production machines.

But the more important advantage is that by having those Debian packages into Debian itself, other people are using the exact same versions of Thumbor's dependencies and Thumbor itself via Debian, thus greatly expanding the exposure of the software we run in production. This increases the likelihood that security issues we might be exposed to are found and fixed.

Beta

Trying to reproduce the production setup locally is always limited. The full complexity of production configuration isn't there, and everything is still running on the same machine. The next step was to convert the Vagrant Puppet code into production Puppet code. Which allowed us to run this on the Beta cluster as a first step, where we could reproduce a setup closer to production with several machines. This was actually an opportunity to improve the Beta cluster to make it have a proper Varnish and Swift setup closer to production than it used to have. Just like the Vagrant improvements, those changes quickly paid off by being useful to others who were working on Beta.

Just like packaging, this new step revealed bugs in the Thumbor plugins Python code that we were able to fix before hitting production.

Pre-production

The Beta wikis only have a small selection of media, and as such we still hadn't been exposed to the variety of content found on production wikis. I was worried that we would run into media files that had special properties in production that we hadn't run into in all the development phase. Which is why I came up with a plan to dual-serve all production requests to the new production Thumbor machines and compare output.

This consisted in modifications to the production Swift proxy plugin code we have in place to rewrite Wikimedia URLs. Instead of sending thumbnail requests to just MediaWiki, I modified it to also send the same requests to Thumbor. At first completely blindly, the Swift proxy would send requests to Thumbor and not even wait to see the outcome.

Then I looked at the Thumbor error logs and found several files that were problematic for Thumbor and not for MediaWiki. This allowed us to fix many bugs that we would have normally found out about during the actual launch. This was also the opportunity to reproduce and iron out the various throttling mechanisms.

To be more thorough, I mage the Swift proxy log HTTP status codes returned by MediaWiki and Thumbor and produced a diff, looking for files that were problematic for one and not the other. This allowed us to find more bugs on the Thumbor side, and a few instances of files that Thumbor could render properly that MediaWiki couldn't!

This is also the phase where under the full production load, our Thumbor configuration started showing significant issues around memory consumption and leaks. We were able to fix all those problems in that fire-and-forget dual serving setup, with no impact at all on production traffic. This was an extremely valuable strategy, as we were able to iterate quickly in the same traffic conditions as if the service had actually launched, without any consequences for users.

Production

With Thumbor running smoothly on production machines, successfully rendering a superset of thumbnails MediaWiki was able to, it was time to launch. The dual-serving logic in the Swift proxy came in very handy: it became a simple toggle between sending thumbnailing traffic to MediaWiki and sending it to Thumbor. And so we did switch. We did that gradually, having more and more wikis's thumbnails rendered by Thumbor over the course of a couple of weeks. The load was handled fine (predictable, since we were handling the same load in the dual-serving mode). The success rate of requests based on HTTP status codes was the same before and after.

However after some time we started getting reports of issues around EXIF orientation. A feature we had integration tests for. But the tests only covered 180 degrees rotation and not 90 degrees (doh!). The Swift proxy switch allowed us to quickly switch traffic back to MediaWiki. We did so because it's quite a prevalent feature in JPGs. We fixed that one large bug, switched the traffic back to Thumbor and that was it.

Some minor bugs surfaced later regarding much less common files with special properties, that we were able to fix very quickly. And deploy fixes for safely and easily with the Debian package. But we could have avoided all of those bugs too if we had been more thorough in the dual-serving phase. We were only comparing HTTP status codes between MediaWiki and Thumbor. However, rendering a thumbnail successfully doesn't mean that the visual contents are right! The JPG orientation could be wrong, for example. If I had to do it again, I would have run DSSIM visual comparisons on the live dual-served production traffic between the MediaWiki and Thumbor outputs. That would have definitely surfaced the handful of bugs that appeared post-launch.

Conclusion

All in all, if you do your homework and are very thorough in testing locally and on production traffic, you can achieve a very smooth launch replacing a core part of infrastructure with completely different software. Despite the handful of avoidable bugs that appeared around the launch, the switch to Thumbor went largely unnoticed by users, which was the original intent, as we were looking for feature parity and ease of swapping the new solution in. Thumbor has been happily serving all Wikimedia production thumbnail traffic since June 2017 in a very stable fashion. This concludes our journey to Thumbor :)

The journey to Thumbor, part 2: thumbnailing architecture

2017-11-17T15:17:26+00:00

Thumbor has now been serving all public thumbnail traffic for Wikimedia production since late June 2017.

In a previous blog post I explained the rationale behind that project. To understand why Thumbor is a good fit, it's important to understand where it fits in our overall thumbnailing architecture. A lot of historic constraints come into play, where Thumbor could be adapted to meet those needs.

The stack

Like everything we serve to readers, thumbnails are heavily cached. Unlike wiki pages, there is no distinction in caching of thumbnails between readers and editors, in fact. Our edge is Nginx providing SSL termination, behind which we find Varnish clusters (both frontends and backend), which talk to OpenStack Swift - responsible for storing media originals as well as thumbnails - and finally Swift talks to Thumbor (previously MediaWiki).

The request lifecycle

Nginx concerns itself with SSL and HTTP/2, because Varnish as a project decided to draw a line about Varnish's concerns and exclude HTTP/2 support from it.

Varnish concerns itself with having a very high cache hit rate for existing thumbnails. When a thumbnail isn't found in Varnish, either it has never been requested before, or it fell out of cache for not being requested frequently enough.

Swift concerns itself with long-term storage. We have a historical policy - which is in the process of being reassessed - of storing all thumbnails long-term. Which means that when a thumbnail isn't in Varnish, there's a high likelihood that it's found in Swift. Which is why Swift is first in line behind Varnish. When it receives a request for a missing thumbnail from Varnish, the Swift proxy first checks if Swift has a copy of that thumbnail. If not, it forwards that request to Thumbor.

Thumbor concerns itself with generating thumbnails from original media. When it receives a request from Swift, it requests the corresponding original media from Swift, generates the required thumbnail from that original and returns it. This response is sent back up the call chain, all the way to the client, through Swift and Varnish. After that response is sent, Thumbor saves that thumbnail in Swift. Varnish, as it sees the response go through, keeps a copy as well.

What's out of scope

Noticeably absent from the above is uploading, extracting metadata from the original media, etc. All of which are still MediaWiki concerns at upload time. Thumbor doesn't try to handle all things media, it is solely a thumbnailing engine. The concern of uploading, parsing and storing the original media is separate. In fact, Thumbor goes as far as trying to fetch as little data about the original from Swift as possible, seeking data transfer efficiency. For example, we have a custom loader for videos that leverages Ffmpeg's support for range requests, only fetching the frames it needs over the network, rather than the whole video.

What we needed to add

We wanted a thumbnailing service that was "dumb", i.e. didn't concern itself with more than thumbnailing. Thumbor definitely provided that, but was too simple for our existing needs, which is why we had to write a number of plugins for it, to add the following features:

New media formats (XCF, DJVU, PDF, WEBM, etc.)
Smarter handling of giant originals (>1GB) to save memory
The ability to run multiple format engines at once
Support for multipage media
Handling the Wikimedia thumbnail URL format
Loading originals from Swift
Loading videos efficiently with range requests
Saving thumbnails in Swift
Various forms of throttling
Live production debugging with Manhole
Sending logs to ELK
Wikimedia-specific filters/settings, such as conditional sharpening of JPGs

We also changed the images included in the Thumbor project to be respectful of open licenses and wrote Debian packages for all of Thumbor's dependencies and Thumbor itself.

Conclusion

While Thumbor was a good match on the separation of concerns we were looking for, it still required writing many plugins and a lot of extra work to make it a drop-in replacement for MediaWiki's media thumbnailing code. The main reason being that Wikimedia sites support types of media files that the web at large cares less about, like giant TIFFs and PDFs.

In the next blog post, I'll describe the development strategy that led to the successful deployment of Thumbor in production.

The journey to Thumbor, part 1: rationale

2017-06-20T15:33:17+00:00

We are currently in the final stages of deploying Thumbor to Wikimedia production, where it will generate media thumbnails for all our public wikis. Up until now, MediaWiki was responsible for generating thumbnails.

I started the project of making Thumbor production-ready for Wikimedia a year and a half ago and I'll talk about this journey in a series of blog posts. In this one, I'll explain the rationale behind this project.

Security

The biggest reason to change the status quo is security. Since MediaWiki is quite monolithic, deployments of MediaWiki on our server fleet responsible for generating thumbnails aren't as isolated as they could be from the rest of our infrastructure.

Media formats being a frequent security breach vector, it has always been an objective of ours to isolate thumbnailing more than we currently can with MediaWiki. We run our command-line tools responsible for media conversion inside firejail, but we could do more to fence off thumbnailing from the rest of what we do.

One possibility would have been to rewrite the MediaWiki code responsible for thumbnailing, turning it into a series of PHP libraries, that could then be run without MediaWiki, to perform the thumbnailing work we are currently doing - while untangling the code enough that the thumbnailing servers can be more isolated.

However such a rewrite would be very expensive and when we can afford to, we prefer to use ready-made open source solutions with a community of their own, rather than writing new tools. It seemed to us that media thumbnailing was far from being a MediaWiki-specific problem and there ought to be open source solutions tackling that issue. We undertook a review of the open source landscape for this problem domain and Thumbor emerged as the clear leader in that area.

Maintenance

The MediaWiki code responsible for thumbnailing currently doesn't have any team ownership at the Wikimedia Foundation. It's maintained by volunteers (including some WMF staff acting in a volunteer capacity). However, the amount of contributors is very low and technical debt is accumulating.

Thumbor, on the other hand, is a very active open-source project with many contributors. A large company, Globo, where this project originated, dedicates significant resources to it.

In the open source world, joining forces with others pays off, and Thumbor is the perfect example of this. Like other large websites leveraging Thumbor, we've contributed a number of upstream changes.

Maintenance of Wikimedia-specific Thumbor plugins remains, but those represent only a small portion of the code, the lion's share of the functionality being provided by Thumbor.

Service-oriented architecture

For operational purposes, running parts of the wiki workflow as isolated services is always beneficial. It enables us to set up the best fencing possible for security purposes, where Thumbor only has access to what it needs. This limits the amount of damage possible in case of a security vulnerability propagated through media files.

From monitoring, to resource usage control and upstream security updates, running our media thumbnailing as a service has significant operational upsides.

New features

3rd-party open source projects might have features that would have been low priority on our list to implement, or considered too costly to build. Thumbor sports a number of features that MediaWiki currently doesn't have, which might open exciting possibilities in the future, such as feature detection and advanced filters.

At this time, however, we're only aiming to deploy Thumbor to Wikimedia production as a drop-in replacement for MediaWiki thumbnailing, targeting feature parity with the status quo.

Performance

Where does performance fit in all this? For one, Thumbor's clean extension architecture means that the Wikimedia-specific code footprint is small, making improvements to our thumbnailing pipeline a lot easier. Running thumbnailing as a service means that it should be more practical to test alternative thumbnailing software and parameters.

Rendering thumbnails as WebP to user agents that support it is a built-in feature of Thumbor and the most likely first performance project we'll leverage Thumbor for, once Thumbor has proven to handle our production load correctly for some time. This alone should save a significant amount of bandwidth for users whose user agents support WebP. This is the sort of high-impact performance change to our images that Thumbor will make a lot easier to achieve.

Conclusion

Those many factors contributed to us betting on Thumbor. Soon it will be put to the test of Wikimedia production where not only the scale of our traffic but also the huge diversity of media files we host make thumbnailing a challenge.

In the next blog post, I'll describe the architecture of our production thumbnailing pipeline in detail and where Thumbor fits into it.

Looking back: improvements to edit save time

2017-06-12T09:32:27+00:00

The WMF's financial year and its annual plan are coming to an end, and one of the Performance team's goals this past year was to reduce the amount of time it takes to save an edit on a wiki.

This set of metrics, which we call Save Timing, is publicly tracked on Grafana. It's recorded for all Wikimedia wikis. It's a critical performance pain point for editors, as edits on large wiki pages can sometimes take seconds to save.

We distinguish the amount of time the backend takes to process the edit, from the amount of time the end-user actually experiences to save the edit (collected client-side). We'll focus on the latter, as this is what people really experience. Backend traffic can come from bots, jobs, etc. where long execution times atypical of human edits affect the metrics.

Let's look at the evolution of frontend save timing since the beginning of the financial year, on July 1st 2016.

The 99th percentile, which represents the slowest editors experience dropped significantly:

Going from 22.4 to 16.82 seconds (weekly average), a 25% improvement.

So did the median:

Going from 953 to 813 milliseconds (weekly average), a 15% improvement.

@aaron deserves most of the credit for this tremendous performance improvement that editors experience every day. Performance is a never-ending goal and we hope to achieve even better save timing in the future thanks to our continued work in this area.

Improving time-to-logo performance with preload links

2017-06-07T07:38:43+00:00

One of the goals of the Wikimedia Performance Team is to improve the performance of MediaWiki and the broader software stack used on Wikimedia wikis. In this article we’ll describe a small performance improvement we’ve implemented for MediaWiki and recently deployed to production for Wikimedia. It highlights some of the unique problems we encounter on Wikimedia sites and how new web standards can be leveraged to improve performance.

Logo as CSS background

The MediaWiki logo is defined as a CSS background image on an element. This is historically for caching reasons, because MediaWiki deployments tend to cache pages as a whole and changing the logo would thus require invalidating all pages if the logo was a regular tag. By having it as a CSS background, updating the logo only requires invalidating the stylesheet where it resides. This constraint has significant implications on when the logo loads.

In the loading sequence of a web page, browsers will give a relatively low priority to CSS background images. In practice, assuming an empty browser cache, this means that the MediaWiki logo loads quite late, after most images that are part of the page content have been loaded. To the viewer, this results in the page loading somewhat out of order: images that aren’t necessarily in view are loaded first, and the logo is one of the last images to be loaded. This breaks the de facto expectation that a web page’s content loads from top to bottom.

This phenomenon extends the average duration of an imaginary metric one could call time-to-logo. The point in time when the logo appears is an important mental milestone, as it’s when a visitor has visual confirmation that they’ve landed on the right website. The issue of time-to-logo being high due to the CSS background limitation is felt even more on slow internet connections, where the logo can take seconds to appear - long after the page’s text and other images lower than the logo on the page have been loaded.

The preload link

We have been looking for a solution to this problem for some time, and a relatively new browser feature has enabled us to develop a workaround. The preload link keyword, developed by the W3C, allows us to inform the browser early that the logo will be needed at some point on the page. This feature can be combined with CSS media queries, which in our case means that the browser will only preload the right version of the logo for the current pixel density/zoom. This is essential, as we don’t want to preload a version of the logo that the page won’t need. Browser cache is also respected, meaning that all we’re doing is loading the logo a lot earlier than it naturally would, which is exactly what we were looking for. In fact, the browser now knows that it needs to load the logo a lot sooner than it would have if we displayed the logo as an element without preload.

The preload links for the site logo have been deployed to production for all Wikimedia wikis. They can easily be spotted in the response header of pages that display the logo (the vast majority - if not all - pages on wikis for desktop users). This is actually leveraging a little-known browser feature where tags can be passed as response headers, which in this situation allows us to inform the browser even sooner that the logo will be needed.

Link: ;rel=preload;as=image;media=not all and (min-resolution:1.5dppx),;rel=preload;as=image;media=(min-resolution:1.5dppx) and (max-resolution:1.999999dppx),;rel=preload;as=image;media=(min-resolution:2dppx)

Measuring the impact

To confirm the expected impact of logo preloading, we recorded a before and after video using synthetic testing with Sitespeed.io, on a simulated slow internet connection, for a large page (the Barack Obama article on English Wikipedia), where the problem was more dramatic. The left pane is the article loading without logo preloading, the right pane is with logo preloading enabled. Focus your attention on the top-left of the article, where the Wikipedia logo is expected to appear:

Unfortunately current javascript APIs in the browser aren’t advanced enough to let us measure something as fine-grained as time-to-logo directly from users, which means that we can only speculate about the extent to which it had an impact in the real world. The web performance field is making progress towards measuring more user-centric metrics, such as First Meaningful Paint, but we’re still very far from having the ability to collect such metrics directly from users.

In our case, the difference seen in synthetic testing is dramatic enough that have a high level of confidence that it has made the user experience better in the real world for many people.

The preload link isn’t supported by all major web browsers yet. When more browsers support it, MediaWiki will automatically benefit from it. We hope that wikis as large as Wikipedia relying on this very useful browser feature will be an incentive for more browsers to support it.

Investigating a performance improvement

2017-06-02T10:02:37+00:00

Last week @Jdlrobson pinged me by email about a performance improvement his team noticed for large wiki articles on the mobile site in our synthetic tests run on WebPageTest. The improvement looked like this, a sudden drop in SpeedIndex (where lower is better):

SpeedIndex is described this way by Pat Meenan, the author of WebPageTest:

The Speed Index is the average time at which visible parts of the page are displayed. It is expressed in milliseconds and dependent on size of the view port.

The actual mathematical definition is a bit more complicated, but essentially it captures, from an end-user perspective, a score of how fast visual completeness is reached above the fold, for a given viewport size and a given internet connection speed. In the case of the WebPageTest run spotted by @Jdlrobson the viewport is mobile phone-sized and the speed is simulated 3G. We run different profiles in WebPageTest that represent different kinds of devices and internet speeds, because sometimes performance changes only affect some devices, some types of connection speeds and some types of wiki pages, even.

As we do for every suspected performance improvement or regression from an unknown cause, we filed a task for it, tagged Performance-Team tag and began investigating: T166373: Investigate apparent performance improvement around 2017-05-24. Creating a task directly and tagging us is also a good way to get our attention.

Comparing synthetic testing and real user metrics

When a change like this happens in synthetic testing, we first verify whether or not a similar change was seen in our real user metrics. Specifically, Navigation Timing in Grafana.

SpeedIndex can't be measured on real users. Real user metrics are limited by the APIs available in browser, which are very basic compared to what WepPageTest can do. There's no way to tell the visual completeness of the whole page from client-side code.

The main real user metric we track is firstPaint. However firstPaint measures something very different than SpeedIndex. FirstPaint is when the web browser starts painting anything on the page. Whereas SpeedIndex is about how fast visual completion in the viewport happens. Essentially, SpeedIndex is about the phase that happens after firstPaint, which real user metrics can't measure. But since they're on the same timeline, it's common for a SpeedIndex change to come with a variation in real user metrics like firstPaint. When that happens, it makes the investigation easier because we know it's not an issue in our telemetry, but a real effect. When there's no correlation between synthetic testing metrics and real user metrics, we just have to keep investigating more in depth.

This fundamental difference means that some performance improvements can improve SpeedIndex, while not changing firstPaint or any other Navigation Timing metric. It's unfortunate in the sense that we know performance has improved, we just can't measure how much it did in the real world for our users. This is exactly what we were seeing here, real user metrics didn't improve during that period. Which doesn't mean that performance didn't really improve for people. As we'll see later, it did. It's also fundamental to understand that Navigation Timing is only a partial view of performance, and that some performance changes simply cannot be measured from real user data at this time.

Comparing WebPageTest runs

The next logical step was to compare WebPageTest runs before and after the performance change. Our synthetic tests, which run continuously, can be consulted on our public WebPageTest instance. WebPageTest's UI isn't the best suited for our use case, so here's a walkthrough of where to look. First you want to click on the test history section, which brings you to this view:

Then click on the show tests from all users checkbox. You should now see all our test runs:

We test a number of pages for the desktop and mobile site, using various simulated internet connection speeds, etc. Finding the tests you're interested in in this history view requires manual labour, as you need to manually search for the labels you're interested in, the search box only applying to the URL.

WebPageTest supports a great feature to compare different runs from the history view. We won't get into that here, though, as the difference is visible from the screenshot of the runs alone. After combing through the history view, I found two runs of the same test (the Sweden article on English Wikipedia, browsing the mobile site on Chrome with a simulated 3G connection.) before and after the SpeedIndex drop.

Before:

After:

It's obvious that the content above the fold changed. The new version displays mostly text above the fold, where the old version had images. This explains the SpeedIndex improvement: it's faster to load text than an image, which means that users get content they can consume above the fold faster. This is more dramatic on slow connections, which is why this performance improvement showed up on our synthetic testing that simulated a 3G connection.

Deliberate or accidental change?

The next part of the investigation was to determine whether that was an accidental change, or a deliberate one. For this, the first place to look is the Wikimedia Server Admin Log. Whenever changes are deployed to Wikimedia production, log entries are added there. Deployments can be individual patches or our weekly deployment train. This part of the investigation is simple: we're simply going through the log, looking for anything that happened around the time the performance change happened.

And sure enough, we found this log entry around the time of the performance change:

18:31 thcipriani@tin: Synchronized wmf-config/InitialiseSettings.php: SWAT: mobileFrontend: Move first paragraph before infobox T150325 (duration: 00m 41s)

The task quoted in that log entry, T150325: Move first paragraph before infobox on stable, is a deliberate change to improve the user experience by showing the first section of an article at the top rather than the infobox. While making this change, @phuedx also improved the performance for users on slow internet connections, who will now see the first section of an article above the fold, which they can start reading early, instead of a mostly empty infobox whose images are still loading.

The Speed of Thought

WikimediaDebug v2 is here!

§ 1. What's new?

Redesigned

Inline profile

Beta Cluster support

§ 2. Features overview

Staging changes

Debug logging

Performance profiling

§ 3. How does it all work?

Browser extension

HTTP header

mediawiki-config

See also

Further reading

Wikipedia's JavaScript initialisation on a budget

How we did it

Fewer modules

Less metadata

What's next

Tracking down slow event handlers with Event Timing

Slow event handlers

Cross-referencing with regular events

What we've found

Feedback

Performance perception: correlation to RUM metrics

Methodology

Results

Analysis

Conclusion

Performance perception: the effect of late-loading banners

Perception distribution

Banner vs no banner

Slow vs fast banners

Slicing it further

Conclusion

Performance perception: how satisfied are Wikipedia users?

How satisfied are our visitors with our page load performance?

The contributor performance penalty

The Spanish paradox

Caching differences

Brand affinity?

Evaluating Element Timing for Images

Background

Element timings

Evaluating element timings

Autonomous Systems performance report

Debugging production with X-Wikimedia-Debug

Enter X-Wikimedia-Debug

Reproducing the bug locally

Conclusion

Magic Numbers

Following the citation trail

What happens when you do real science

Proving universal facts about mankind based on students down the hall

Everything sucks, now what?

Why performance matters

Perf Matters at Wikipedia in 2015

Hello, WANObjectCache

First paint time now under 900ms

Hello, WebPageTest

The journey to Thumbor begins

Save timing reduced by 50%

Deferring tasks to post-send

Working with the database, instead of against it

Avoid synchronous cache re-computations

Profiling can be expensive

And more

So long, Sajax!

Further reading

Machine learning: how to undersample the wrong way

Class imbalance

I've made a huge mistake

A simple solution

Best friends forever

Chrome 67

Next step

Chrome 69

Summary