User-perceived page load performance study
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Gilles
	Feb 14 2018, 9:41 AM

Description

Research page: https://meta.wikimedia.org/wiki/Research:Study_of_performance_perception

Introduction

The current metrics we use to measure page load performance are based on assumptions about what users prefer. Even the assumption that faster is always better is universal in the metrics used in this field, while academic research shows that this might not be a universal main criterion to assess the quality of experience (performance stability might be preferred) and is likely to depend on the subject and the context.

We usually deal with two classes of metrics, real user metrics (RUM) that we collect passively from users, leveraging the performance information that we can capture client-side. It's usually very low level, highly granular and quite disconnected from the experience from the user's perspective. The other type of metric we use is synthetic, where we have automated tools simulate the user experience and measure things. These get closer to the user experience, by allowing us to measure visual characteristics of a page load. But both are far from capturing what the page feels like to users, because when the measurement is made, they don't require any human input. Even their modeling is often just a best guess by engineers, and only recently have studies looked at the correlation between those metrics and user sentiment. It wasn't part of the metrics' design.

In this study, we would like to bridge the gap between user sentiment and the passive RUM performance metrics that are easy to collect unobtrusively.

Collecting user-perceived page load performance

A lot of studies put users' nose in the mechanics of the page load to ask them about it. For example, showing them 2 videos side by side of the same page loading differently. This is a very artificial exercise and disconnected from the real-world experience of loading a page in the middle of a user flow. In this study we want to avoid interfering with the page load experience. Which is why we plan to ask the real users, on the production wikis, about their experience after a real page load has finished, in the middle of their browsing session. After a random wiki page load has happened, the user viewing it is asked via an in-page popup to score how fast or pleasant that page load was.

Users will have an option to opt out of this surveying permanently (preference stored in local storage). It might be interesting to give them different options to dismiss it (eg. "I don't want to participate", "I don't understand the question") in order to tweak the UI if necessary.

Collecting RUM metrics alongside

This is very straightforward, as we already collect such metrics. These need to be bundled with the survey response, in order to later look for correlations. In addition to performance metrics, we should bundle anonymized information about things that could be relevant to performance (user agent, device, connection type, location, page type, etc.). Most of this information is already being collected by the NavigationTiming extension and we could simply build the study on top of that.

Attempting to build a model that lets us derive user-perceived performance scores from RUM data only

Once we have user-perceived performance scores and RUM data attached to it, we will attempt to build a model that reproduces user-perception scores based on the underlying RUM metrics.

We can try building a universal model at first, applying to all users and all pages on the wiki. And then attempt to build context-specific models (by wiki, page type, connection type, user agent, location, etc.) to see if we could get better correlation.

Ideally, given the large amount of RUM data we can collect (we could actually collect more than we currently do), we would be trying the most exhaustive set of features possible. We should try both expert models and machine learning, as prior work has shown that they can both give satisfying results in similar contexts.

Scope

While it would be nice to have the user-perceived performance scores collected on all wikis, some have communities that are less likely to welcome such experiments by the WMF. We could focus the initial study on wikis that are usually more friendly to cutting-edge features, such as frwiki or cawiki. Doing at least 2 wikis would be good, in order to see if the same model could work for different wikis, or if we're already finding significant differences between wikis.

This study will focus only on the desktop website. It can easily be extended to the mobile site or even the mobile apps later, but for the sake of validating the idea, focusing on a single platform should be enough. There is no point making the study multi-platform if we don't get the results we hope for on a single one.

Challenges

Picking when to ask. How soon in the page load lifecycle is too soon to ask? (the user might not consider the page load to be finished yet) How late is too late? (the user might have forgotten how the page load felt)

Does the survey pop-up interfere with the page load perception itself? We have to display a piece of UI on screen to ask the question, and it's part of the page. We need to try to limit the effect that this measurement has as much as possible. This should be one of the main criteria in the design, that this survey UI's appearance doesn't feel like it's part of the initial page load.

How should the question be asked? Phrasing matters. If we ask a question too broad (eg. are you having a pleasant experience?) people might answer thinking about a broader context, like their entire browsing session, the contents of the page, or whether or not they found what they wanted to find on the wiki. If the question is too narrow, it might make them think too much about page loading mechanics they normally don't think about.

What grading system should we use? There are a number of psychological effects at play when picking a score for something, and we should be careful to pick the model that's the most appropriate for this task.

Limitations

This study won't look at performance stability. For example, if the page loads before the one being surveyed were unusually fast or slow, this will likely affect the perception of the current one. We could explore that topic more easily in a follow-up study if we identify meaningful RUM metrics in this initial study limited to page load studies in isolation.

Expected outcomes

We don't find any satisfying correlation between any RUM-based model, even sliced by page type/wiki/user profile. This informs us, and the greater performance community, that RUM metrics are a poor measurement of user-perceived performance. It would be a driving factor to implement new browser APIs that measure performance metrics closer to what users really experience. And in the short term it would put a bigger emphasis on synthetic metrics as a better reference for user-perceived performance (as there has been academic work showing a decent correlation there already). It could also drive work into improving synthetic metrics further. Also, from an operational perspective, if we keep the surveys running indefinitely, we would still get to measure user-perceived performance globally as a metric we can follow directly. It will be harder to make it actionable, but we would know globally if user sentiment is getting better or worse over time and slice it by different criteria.
We find a satisfying RUM-based universal model. Depending on its characteristics, we can assess whether or not it's a wiki-specific one, or if we potentially uncovered a universal understanding, that could be verified in follow-up studies done by others on other websites.

We find a satisfying RUM-based model adapted to some context. This would change the way performance optimization is done, by showing that context matters, meaning that improving performance might not take the form of a one-size-fits-all solution.

In the last 2 cases, this would allow us to have a universal performance metric that we can easily measure passively at scale and that we know is a good representation of user perception. This would be a small revolution in the performance field, where currently the user experience and the passive measurements are completely disconnected.

Details

Subject	Repo	Branch	Lines +/-
Track performance perception survey impressions	performance/navtiming	master	+20 -0
Retain RUMSpeedIndex	analytics/refinery	master	+5 -0
Retain more performance data	analytics/refinery	master	+35 -0
Fix NavigationTimingOversampleFactor validation	mediawiki/extensions/NavigationTiming	master	+6 -1
Fix NavigationTimingOversampleFactor validation	mediawiki/extensions/NavigationTiming	wmf/1.33.0-wmf.19	+6 -1
Oversample navtiming on ruwiki and eswiki	operations/mediawiki-config	master	+15 -3
Launch performance perception survey on eswiki	operations/mediawiki-config	master	+3 -0
Add ability to set different survey rate for logged-in users	mediawiki/extensions/NavigationTiming	master	+20 -3
Revert ruwiki navtiming rate	operations/mediawiki-config	master	+0 -1
Increase ruwiki navtiming rate + frwiki survey rate	operations/mediawiki-config	master	+2 -12
Oversample performance survey on specific ruwiki articles	operations/mediawiki-config	master	+13 -1
Double performance perception survey sampling on ruwiki	operations/mediawiki-config	master	+1 -1
Keep recently added navtiming + survey fields	analytics/refinery	master	+41 -0
Add python-geoip2 to stat machines	operations/puppet	production	+1 -0
Enable performance survey on ruwiki	operations/mediawiki-config	master	+4 -1
Enable performance survey on frwiki	operations/mediawiki-config	master	+4 -2
Launch performance survey on cawiki and enwikivoyage	operations/mediawiki-config	master	+14 -0
Add support for performance survey	mediawiki/extensions/NavigationTiming	master	+58 -8
Fix stray space in quicksurvey configuration	operations/mediawiki-config	master	+1 -1
Add performance perception QuickSurvey definition	operations/mediawiki-config	master	+23 -1
Add missing dependency on mediawiki.user	mediawiki/extensions/NavigationTiming	master	+2 -1
Add performance perception survey wording	mediawiki/extensions/WikimediaMessages	master	+12 -4
Add performance perception survey	mediawiki/vagrant	master	+26 -1
Easier survey invocation	mediawiki/extensions/QuickSurveys	master	+13 -13
mediawiki.user: Implement mw.user.stickyRandomId	mediawiki/core	master	+24 -2

Related Objects
Search...

Status	Assigned	Task
Resolved	• Gilles	T165272 Review research on performance perception
Declined	• Gilles	T184510 Ideas for performance perception studies
Resolved	• Gilles	T187299 User-perceived page load performance study
Resolved	• Whatamidoing-WMF	T188503 Identify two wikis to run a research study and get approval from their respective communities
Resolved	• Gilles	T195840 Track when a CentralNotice banner was displayed to the user in NavTiming
Declined	• Gilles	T196163 Add ability to render/inject QuickSurveys server-side to loggedin users
Resolved	• Gilles	T196528 Setup dashboard for performance survey responses
Resolved	• Gilles	T196772 Performance survey shouldn't appear on category page
Resolved	• Gilles	T196775 Record monotonic time of survey impression
Resolved	• Gilles	T197607 Add ability to oversample specific pages
Resolved	• Gilles	T197609 Collect ResourceTiming data of top article image
Declined	• Gilles	T197610 Record which DC served the request
Invalid	• Gilles	T197611 Measure approximate top paragraph timing
Resolved	• Gilles	T197974 Record transferSize in Navigation Timing data
Resolved	• Gilles	T204921 Rename EventLogging column surveyInstanceToken to pageviewToken in QuickSurveysResponses for consistency
Declined	None	T204922 Rename column on old hive data for a few tables
Resolved	• Gilles	T205533 Create views to simplify access to renamed columns on NavigationTiming and Quicksurveys schemas
Resolved	• Gilles	T205580 Microbenchmark device power and record results in NavigationTiming
Resolved	• Whatamidoing-WMF	T212304 Ask the eswiki community whether we can run the performance perception survey there
Resolved	Slaporte	T217318 A Large-scale Study of Wikipedia Users' Quality of Experience: data release
Declined	• Gilles	T224248 Record order of randomized survey options
Declined	• Gilles	T224252 Memorize and store prior pageviews performance in the current session
Declined	• Gilles	T224253 Add secondary question(s) for the performance study
Resolved	• Gilles	T180667 Collect RUMSpeedIndex from users

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• Gilles added a subtask: T205580: Microbenchmark device power and record results in NavigationTiming.Sep 26 2018, 8:44 PM

Regarding July 5th, I looked into it and I don't see a survey responses spike in Hive:

SELECT COUNT(DISTINCT(event.surveyinstancetoken)), COUNT(*) FROM event.quicksurveysresponses WHERE year = 2018 AND month = 7 AND day > 1 AND day < 8 GROUP BY day;

421	428
409	418
376	384
390	403
388	399
433	446

It's possible that what's seen in Grafana is a bug/overcounting in Graphite, which is a different datastore used to back these graphs. Hive is the canonical data store.

I forgot one very important detail about the survey impression: it's only recorded once the user has the survey in their viewport. Which explains why we have some that take minutes. It's simply that people initially scrolled past the survey before it appeared, then scrolled back to the top of the article after doing some reading.

We actually *don't* measure the amount of time it takes to download the survey assets and insert the survey into the page, which in the case of impressions that take minutes, happened way before quicksurveyimpression's performancenow.

Now, the figures I was seeing make a lot more sense. For page loads that take <1s, it's extremely unlikely that the survey isn't already inserted if users see it after 5s. It just means that they scrolled down and went back to the top of the page.

Let's break this down into buckets to see if there is a progressive drop off satisfaction rates. For page loads taking less than 1s, with s being the survey impression time in seconds:

s < 1	1 <= s < 3	3 <= s < 6	6 <= s < 10s	10 <= s < 20	20 <= s < 30	30 <= s < 40	40 <= s < 120
95.25%	92.56%	92.59%	94.44%	91.68%	89.65%	87.76%	94.5%

Query used:

SELECT q2.event.surveyResponseValue, COUNT(*) AS count FROM event.quicksurveyinitiation AS q INNER JOIN event.navigationtiming n ON q.event.surveyInstanceToken = n.event.stickyRandomSessionId INNER join event.quicksurveysresponses q2 ON q.event.surveyInstanceToken = q2.event.surveyInstanceToken WHERE n.year = 2018 AND q.year = 2018 AND q2.year = 2018 AND q.event.surveyCodeName = "perceived-performance-survey" AND q.event.performanceNow IS NOT NULL AND n.event.loadEventEnd IS NOT NULL AND q2.event.surveyResponseValue IS NOT NULL AND n.event.loadEventEnd < 1000 AND q.event.performanceNow - n.event.loadEventEnd >= 30000 AND q.event.performanceNow - n.event.loadEventEnd < 40000 GROUP BY q2.event.surveyResponseValue;

I think this shows that as time goes by after the page load, people's memory of it becomes less reliable. This probably calls for filtering out late survey responses from the study. Anything beyond 10 seconds is probably too unreliable to be considered.

I added new features to my random forest model, namely the top image resource timing and the survey appearance time (full list of features in the view I created). I also filtered out all survey responses where the survey appeared more than 10 seconds after the pageload. Lo and behold:

	precision	recall	f1-score	support
-1	0.75	0.81	0.78	1312
1	0.79	0.73	0.76	1288
avg / total	0.77	0.77	0.77	2600

Same thing, with the randomness pinned to a specific seed:

    precision    recall  f1-score   support

-1       0.75      0.81      0.78      1247
 1       0.81      0.75      0.78      1353

 avg / total       0.78      0.78      0.78      2600

And now, looking at feature importance for that model:

page_mediawikiloadend	0.05972091225258912
page_dominteractive	0.05555291638900518
page_domcomplete	0.05029047286517806
page_loadeventend	0.048274963237053846
page_loadeventstart	0.046181668409670085
ip	0.046109392740830896
survey_viewtime	0.04488665276254195
page_responsestart	0.04440687952739068
topimage_responseend	0.04077331115712293
recvfrom	0.029823405874552695
browsermajor	0.029482259852426335
topimage_responsestart	0.028736209656332376
topimage_starttime	0.028290985156393226
page_transfersize	0.028204055030698078
page_firstpaint	0.028037362514820707
page_requeststart	0.027769567173480763
topimage_fetchstart	0.02661139926653559
page_connectend	0.025833714670749037
page_rumspeedindex	0.02462637982425334
country	0.02395627074244878
devicefamily	0.02183131546802089
page_secureconnectionstart	0.02088374156242488
page_connectstart	0.019940496531770834
osmajor	0.019530654385698672
topimage_requeststart	0.018620757009868052
topimage_connectstart	0.01849116651369536
topimage_domainlookupstart	0.015145333820839442
page_fetchstart	0.014936654879505382
topimage_connectend	0.013892746408479624
browserfamily	0.012989476946930726
osminor	0.012813699149677529
topimage_domainlookupend	0.012744622802096366
topimage_encodedbodysize	0.009540190724377883
browserminor	0.009194449645718927
topimage_transfersize	0.007657837205586284
osfamily	0.006915917271512656
topimage_decodedbodysize	0.006408494747900259
wiki	0.006123717189758629
effectiveconnectiontype	0.004893297204520509
topimage_secureconnectionstart	0.00369666570159363
user_editcountbucket	0.0027404407734646537
page_redirecting	0.002609271324344027
page_responseend	0.0008302736281410966
topimage_workerstart	0.0
topimage_redirectend	0.0

It's interesting to see the prevalence of some of the top image metrics, when those have only been collected since September 18th (and thus are missing from a large portion of the records used in this analysis!). I'll look into re-doing the same model using only data from September 18 onwards. It might also be interesting to model against only articles that have a top image.

I'll also remove the survey view time, which is not something we'll have on NavigationTiming data outside of the survey.

Re-running the same analysis only with data more recent than when ResourceTiming was deployed (2018-09-18T19:40:06Z), and removing survey display time as a feature:

	precision	recall	f1-score	support
-1	0.85	0.73	0.78	336
1	0.71	0.83	0.76	264
avg / total	0.79	0.78	0.78	600

Not a big difference compared to before. The top features are the same in a slightly different order, the top image ones are not more prominent than before.

Finally, looking only at entries that have top image data reduces the dataset too much, yielding worse results. We'll have to wait until we've collected more data to re-run that one.

• Nuria subscribed.Oct 3 2018, 5:08 PM

• Gilles closed subtask T195840: Track when a CentralNotice banner was displayed to the user in NavTiming as Resolved.Oct 4 2018, 10:14 AM

Very sadly while in the process of trying to make the same code run on Python 3 I updated sklearn, without keeping track of which old version I was using before, and now I can't get results that are as good as earlier this week. This is what I get now, over the whole dataset, minus surveys that take > 10s to be viewed:

	precision	recall	f1-score	support
-1	0.66	0.84	0.74	1292
1	0.80	0.60	0.69	1386
avg / total	0.73	0.72	0.71	2678

And that's with the Python 2 version of my code.

For Python 3 I run into problems with fit_transform that force me to map some columns manually to numerical values, which in turn turns into worse results than with Python 2. It might be interesting to compare how the values get transformed by the Python 2 library. Differences in the default strategy for that might explain how different the results have been.

It could also have been a fluke with the snapshot of data used the other day and with new negative responses recorded the old model might have worked less effectively.

I noticed, though, that in my undersampling experiments I wasn't always using the same amount of positive responses relative to negative responses. Which made me wonder what would happen if I undersampled the positive responses even more? Essentially using class imbalance in our favor, because we care more about capturing the negative responses than we do about capturing the positive ones. And that seems to be what has the biggest impact. This is slightly overdoing it, but clearly shows the effect:

	precision	recall	f1-score	support
-1	0.78	0.92	0.84	1306
1	0.78	0.50	0.61	703
avg / total	0.78	0.78	0.76	2009

And toning it down a little:

	precision	recall	f1-score	support
-1	0.67	0.89	0.76	1317
1	0.77	0.47	0.58	1094
avg / total	0.71	0.70	0.68	2411

• Gilles closed subtask T205533: Create views to simplify access to renamed columns on NavigationTiming and Quicksurveys schemas as Resolved.Oct 5 2018, 1:36 PM

By adding the time to the features, in the form of unix timestamp, hour, minute, seconds, day of the week, as well as the ISP's ASN (which I don't think actually contributes much to the improvement) I'm back into excellent results:

	precision	recall	f1-score	support
-1	0.77	0.84	0.80	1323
1	0.83	0.77	0.80	1393
avg / total	0.80	0.80	0.80	2716

This probably works without accounting for users timezones because the target wikis (fr, ca, ru) all have traffic heavily centered around continental Europe. To do this even better, we would need to collect the client-side timezone in NavigationTiming and apply it to the timestamps recorded by EventLogging.

And if we leave out the top image metrics, because they've only been recording since mid-September, we get these results:

	precision	recall	f1-score	support
-1	0.73	0.89	0.80	1323
1	0.87	0.69	0.77	1393
avg / total	0.80	0.79	0.79	2716

Still pretty great!

Now, it's quite noteworthy that the most prominent feature, by far, is the unix timestamp:

unifiedperformancesurvey.ts 0.12601567188
unifiedperformancesurvey.page_loadeventstart 0.0552107967697
unifiedperformancesurvey.page_mediawikiloadend 0.0542528347965
unifiedperformancesurvey.page_dominteractive 0.0499585312226
unifiedperformancesurvey.page_domcomplete 0.0452909574817
unifiedperformancesurvey.page_loadeventend 0.0439757009884
unifiedperformancesurvey.ip 0.0428143759841
unifiedperformancesurvey.asn 0.0421671778307
unifiedperformancesurvey.minute 0.0407356065918

Which, might look surprising at first, when you consider that the random forest tree will have branches with rules like "if that value is greater than, then do go branch A, otherwise go to branch B". Given how the training and validation work, the fact that the model can predict things during the timespan of recorded data doesn't mean that it will be able to infer things correctly from the timestamps of future times. It's interesting nonetheless, because it might indicate that having the timestamp is necessary to account for seasonality, or changes in the environment (eg. performance improvement/regression). If we had a year's worth of data, we could try adding month and day of the month as features.

To close on these great results, let's try one last time without the top image metrics and without the timestamp (but keeping hour, minute, seconds, weekday):

	precision	recall	f1-score	support
-1	0.72	0.85	0.78	1323
1	0.83	0.68	0.75	1393
avg / total	0.77	0.76	0.76	2716

Not bad at all :)

The prominence of timestamp makes a ton of sense to me. Depending on the time of day, users are either almost certainly hitting ESAMS (which will have a hot cache for all of the test wikis), or hitting a different data center (in which case they're much less likely to have a hot cache for frwiki and the like). I wouldn't be surprised if hour was the most meaningful factor out of (month, day, hour, minute).

Without using the timestamp in my latest model it's actually backwards in terms of importance, but they're in the same range, some of the ordering is just coming from the randomness of the random forest algorithm. I wouldn't read too much into the order of the feature importance when the values are in the same range:

unifiedperformancesurvey.seconds 0.0490926217513
unifiedperformancesurvey.minute 0.0467801232989
unifiedperformancesurvey.hour 0.0394378997918

Timestamp really stands out when it's in the mix, though, in all the models I've been running since the beginning it's the highest importance value I've seen, by far. It's almost twice the value of the second best.

I think that to verify that the model incorporating the timestamp really works, I need to set apart a whole chunk of separate time as part of the validation set. Right now I believe that the splitting algorithm is picking training values and validation values randomly in the same (whole) timespan. This intertwining of training and validation values over time is probably what makes timestamp so effective as a feature, in my opinion. Essentially, if you know how people feel around a specific time, it's easier to predict the performance perception of others around the same time. With Russian wikipedia dominating the data because of its larger traffic, it's easy to imagine that collectively internet users in Russia might be experiencing slowdowns at the same time. It could be coming from us, from the network, from traffic peaks at specific times of day. It could also be traffic spikes to specific articles.

I'm going to lock down the task somewhat for now at Dario Rossi's request, because having preliminary results available publicly is problematic for the double-blind submission of the research paper.

• Gilles removed a project: Patch-For-Review.Oct 5 2018, 6:49 PM

• Gilles changed the visibility from "Public (No Login Required)" to "Custom Policy".

Another thing that occurred to me: we know that we've had factors (eg, Chrome 69) in the study period that may have changed actual perception for a large group of subjects. I wonder if that could also be influencing the importance of timestamp as a factor?

Absolutely, and the DC switchover, students going back to school in September, etc.

I've added month, month day and year day, while still keeping actual timestamp out of the features, and it's "year day" that's super prominent, almost as much as timestamp was. I think this backs up the fact that what's happening is that it's capturing long-term trends that are critically affecting the user perception. It's actually breaking a new record by integrating these instead of the raw timestamp:

	precision	recall	f1-score	support
-1	0.80	0.88	0.84	1323
1	0.87	0.79	0.83	1393
avg / total	0.84	0.83	0.83	2716

Now the question is, to what extent it's capturing cyclical (back to school) and non-cyclical (DC switchover, browser update) events. Maybe we can tell a bit by looking at which days the tree branches end up using as cutoff points by inspecting the generated trees, I'll look at that next week.

If we want the model to keep capturing non-cyclical events, it will have to be streaming or run regularly against recent data. In other words, to be able to make it adapt to non-cyclical events, we have to keep the survey running permanently for a small fraction of users.

Hi Gilles, nice results!
Unluckily I am not able to replicate them and I am still stuck to 0.6ish of recall for negative replies. Can you please post the query you are using to collect the data and the information referred to the qsi.event.performanceNow to filter out the late surveys?

I have been trying this one and succeeded for September but when it comes to the month of August I have

Error: org.apache.thrift.TException: Error in calling method CloseOperation (state=08S01,code=0)
Error: Error while cleaning up the server resources (state=,code=0)

This is the query I am using

SELECT * FROM event.quicksurveysresponses AS qsr
INNER JOIN event.quicksurveyinitiation qsi ON qsr.event.surveyInstanceToken = qsi.event.surveyInstanceToken
INNER JOIN event.navigationtiming nt ON qsr.event.surveyInstanceToken = nt.event.stickyRandomSessionId
WHERE qsr.year = 2018 AND qsi.year = 2018 AND nt.year = 2018 
AND qsr.month = 8 AND qsi.month = 8 AND nt.month = 8
AND qsr.event.surveyCodeName = "perceived-performance-survey"
AND qsr.event.surveyResponseValue IN ('ext-quicksurveys-example-internal-survey-answer-positive', 'ext-quicksurveys-example-internal-survey-answer-negative')
AND qsi.event.performanceNow - nt.event.loadEventEnd < 10000;

I maybe managed to replicate your results (If the query I made to the dB was correct), but these one, in my case, refer only to one fold, which appears to be a "lucky one", since when I perform a 10-fold validation there is a significant drop in the performances, specially for the recall of (-1), with respect to these:

precision    recall  f1-score   support

          -1       0.75      0.88      0.81      1047
           1       0.85      0.70      0.77      1010

   micro avg       0.79      0.79      0.79      2057
   macro avg       0.80      0.79      0.79      2057
weighted avg       0.80      0.79      0.79      2057

The current dataset I'm using is coming from the view I've created. Querying this view takes 30+ minutes, though.

You can see the query it's doing with

SHOW CREATE TABLE unifiedPerformanceSurvey;

Your query looks fine, it would just be missing more recent data where the column in the navigationtiming table was renamed.

Anyway, instead of running the query I've already ran, you can just access the data directly under /home/gilles/export.tsv on stat1004

You can also find the script I'm using to process the data (where, quite importantly, I'm dropping a bunch of features captured by that query) under /home/gilles/randomforest2.py also on stat1004. You need to run it with Python 2.

As you'll see at the end, I'm using GridSearchCV with 10-fold cross-validation. Unless I'm mistaken, it's selecting the best estimator based on that. All the classification reports I've shared came from that.

• Gilles closed subtask T205580: Microbenchmark device power and record results in NavigationTiming as Resolved.Oct 19 2018, 11:15 AM

• Gilles closed subtask T204921: Rename EventLogging column surveyInstanceToken to pageviewToken in QuickSurveysResponses for consistency as Resolved.Oct 26 2018, 10:08 AM

Looking only at November data, which contains new metrics like the cpu benchmark and top image resource timing, these are the top Pearson correlations for individual metrics:

page_responseend	-0.124995
page_loadeventstart	-0.118504
page_domcomplete	-0.118501
page_loadeventend	-0.118448
page_dominteractive	-0.117836
topimage_starttime	-0.100179
topimage_fetchstart	-0.100179
page_tcp	-0.095061
topimage_domainlookupend	-0.093590
topimage_connectstart	-0.093457
topimage_responseend	-0.092828
topimage_domainlookupstart	-0.092723
topimage_requeststart	-0.091953
topimage_connectend	-0.091017
topimage_responsestart	-0.087445
page_processing	-0.086445
page_rumspeedindex	-0.082267
topimage_secureconnectionstart	-0.069033

And for reference, firstPaint is close to 0 (and positively correlated, even):

page_firstpaint

0.005310

In the new ones it's interesting to see that central notice time has a low correlation:

centralnoticetime

-0.020357

CPU score correlation isn't high:

cpuscore

-0.044808

While top image latency seems important, the size of the image isn't:

topimage_transfersize

-0.001743

It's too bad that we can't (yet) have such a thing as top image firstPaint, it might score high.

Now that we have CPU benchmarking data, we can verify what came out of the manual labelling of device power based on UA. And just like before, the slower the device, the more unhappy you are likely to be about your pageload's performance:

Subset of responses	Percentage of positive responses
All	88.74
100 < cpu score <= 200	90.75
200 < cpu score <= 300	90.46
300 < cpu score <= 400	89.01
400 < cpu score <= 500	86.73
500 < cpu score	85.81

• Gilles closed subtask T197607: Add ability to oversample specific pages as Resolved.Dec 11 2018, 10:21 AM

Following the discussion we've had during the private presentation on Monday, next actionnables seem to be:

increase overall NavigationTiming sampling rate on ruwiki
run a community consultation on eswiki to have the survey run there

After checking the data on Turnilo, I see that eswiki has the same amount of traffic as ruwiki. If we apply the same new sampling rate (1 in every 100) to eswiki as we're going to for ruwiki, both measures combined would increase our data collection by 20x. Meaning we could potentially collect approximately 15000 survey responses per day, or 5.4 million per year.

• Gilles removed a project: MW-1.32-notes (WMF-deploy-2018-05-01 (1.32.0-wmf.2)).Dec 20 2018, 9:37 AM

• Gilles changed the visibility from "Custom Policy" to "Custom Policy".

I've increased the overall navtiming rate for ruwiki and the survey rate for frwiki

Change 483369 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Revert ruwiki navtiming rate

https://gerrit.wikimedia.org/r/483369

gerritbot added a project: Patch-For-Review.Jan 10 2019, 9:12 AM

Change 483369 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert ruwiki navtiming rate

https://gerrit.wikimedia.org/r/483369

Mentioned in SAL (#wikimedia-operations) [2019-01-10T09:52:48Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T187299 Decrease ruwiki navtiming rate (duration: 00m 52s)

@stjn I tried to respond to your message on ruwiki but it appears I've been blocked from editing there...

Never mind, I'm not blocked, my message was probably just triggering an abuse filter. I was just responding that I'm going to work on making a different sampling rate for editors and readers possible for the survey.

Great, sorry for not answering there sooner.

Change 485763 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Add ability to set different survey rate for logged-in users

https://gerrit.wikimedia.org/r/485763

I think we could collect fairly cheaply a history of past loadEventStart stored in localStorage and have that recorded when the user is sampled by navtiming. This way we would be able to see if the pageview respondents are asked about is unusually fast/slow compared to what they've previously experienced. Something like an array of timestamp + loadEventEnd recorded for each article view, and we keep the last 10/20.

• Whatamidoing-WMF closed subtask T212304: Ask the eswiki community whether we can run the performance perception survey there as Resolved.Feb 8 2019, 12:09 AM

Change 485763 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Add ability to set different survey rate for logged-in users

https://gerrit.wikimedia.org/r/485763

Change 491229 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Launch performance perception survey on eswiki

https://gerrit.wikimedia.org/r/491229

Change 491229 merged by jenkins-bot:
[operations/mediawiki-config@master] Launch performance perception survey on eswiki

https://gerrit.wikimedia.org/r/491229

Mentioned in SAL (#wikimedia-operations) [2019-02-19T11:26:13Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T187299 Launch performance perception survey on eswiki (duration: 00m 46s)

• Gilles removed a project: Patch-For-Review.Feb 19 2019, 11:30 AM

I've had a look at the eswiki data on https://grafana.wikimedia.org/d/000000551/performance-perception-survey and picking a couple of different days, the 87% satisfaction ratio is true there as well. I think it's quite remarkable that this holds true, it's the same ratio we had on ruwiki.

Change 493055 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Oversample navtiming on ruwiki and eswiki

https://gerrit.wikimedia.org/r/493055

gerritbot added a project: Patch-For-Review.Feb 26 2019, 3:42 PM

Change 493055 merged by jenkins-bot:
[operations/mediawiki-config@master] Oversample navtiming on ruwiki and eswiki

https://gerrit.wikimedia.org/r/493055

Mentioned in SAL (#wikimedia-operations) [2019-03-05T10:07:37Z] <gilles@deploy1001> Synchronized php-1.33.0-wmf.19/extensions/NavigationTiming: T187299 Backport wiki oversampling config syntax change (duration: 00m 48s)

Mentioned in SAL (#wikimedia-operations) [2019-03-05T10:10:50Z] <gilles@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T187299 Oversample navtiming on ruwiki and eswiki (duration: 00m 47s)

Wiki oversampling results in a bunch of warnings: https://logstash.wikimedia.org/app/kibana#/dashboard/1c3a4d80-35c2-11e7-b186-d1bc9cbdde4c?_g=h@ba40421&_a=h@f9f2916

PHP Warning: array_filter() expects parameter 1 to be an array or collection
t  exception.trace	       	#0 [internal function]: MWExceptionHandler::handleError(integer, string, string, integer, array, array)
#1 /srv/mediawiki/php-1.33.0-wmf.19/extensions/NavigationTiming/NavigationTiming.config.php(41): array_filter(integer, Closure$NavigationTimingConfig::getNavigationTimingConfigVars;1616, integer)
#2 /srv/mediawiki/php-1.33.0-wmf.19/includes/resourceloader/ResourceLoaderFileModule.php(1121): NavigationTimingConfig::getNavigationTimingConfigVars(ResourceLoaderContext)
#3 /srv/mediawiki/php-1.33.0-wmf.19/includes/resourceloader/ResourceLoaderFileModule.php(623): ResourceLoaderFileModule->expandPackageFiles(ResourceLoaderContext)
#4 /srv/mediawiki/php-1.33.0-wmf.19/includes/resourceloader/ResourceLoaderModule.php(827): ResourceLoaderFileModule->getDefinitionSummary(ResourceLoaderContext)
#5 /srv/mediawiki/php-1.33.0-wmf.19/includes/resourceloader/ResourceLoader.php(662): ResourceLoaderModule->getVersionHash(ResourceLoaderContext)
#6 [internal function]: Closure$ResourceLoader::getCombinedVersion(string)
#7 /srv/mediawiki/php-1.33.0-wmf.19/includes/resourceloader/ResourceLoader.php(674): array_map(Closure$ResourceLoader::getCombinedVersion;613, array)
#8 /srv/mediawiki/php-1.33.0-wmf.19/includes/resourceloader/ResourceLoader.php(755): ResourceLoader->getCombinedVersion(ResourceLoaderContext, array)
#9 /srv/mediawiki/php-1.33.0-wmf.19/load.php(46): ResourceLoader->respond(ResourceLoaderContext)
#10 /srv/mediawiki/w/load.php(3): include(string)
#11 {main}

This comes from oversample array validation coming from NavigationTiming.config.php

Change 494460 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Fix NavigationTimingOversampleFactor validation

https://gerrit.wikimedia.org/r/494460

Change 494463 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@wmf/1.33.0-wmf.19] Fix NavigationTimingOversampleFactor validation

https://gerrit.wikimedia.org/r/494463

Change 494463 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@wmf/1.33.0-wmf.19] Fix NavigationTimingOversampleFactor validation

https://gerrit.wikimedia.org/r/494463

Mentioned in SAL (#wikimedia-operations) [2019-03-05T10:55:07Z] <gilles@deploy1001> Synchronized php-1.33.0-wmf.19/extensions/NavigationTiming/NavigationTiming.config.php: T187299 Fix wiki oversampling config validation (duration: 00m 48s)

All good now. Warning went away and oversampling active, as confirmed by EventLogging-schema dashboard on Grafana:

Capture d'écran 2019-03-05 12.01.09.png (956×3 px, 302 KB)

It more than doubles the navtiming events, but it's still small among the overall EventLogging traffic (7-8 events/sec vs 1694).

QuickSurvery responses also increased, as expected:

Capture d'écran 2019-03-05 12.02.54.png (958×3 px, 235 KB)

• Gilles removed a project: Patch-For-Review.Mar 5 2019, 11:03 AM

Change 494460 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Fix NavigationTimingOversampleFactor validation

https://gerrit.wikimedia.org/r/494460

As a result of the oversampling, we are showing the survey to 10x anonymous users and the same amount of logged-in users as before. We are only getting 3x the responses (and that's true for both eswiki and ruwiki). This shows that there are diminishing returns to displaying the survey more.

As a result, at the current rate we are collecting around 17k non-neutral survey responses per day across all surveyed wikis, which works out to 510k per month, 6+ million per year:

0: jdbc:hive2://an-coord1001.eqiad.wmnet:1000> SELECT COUNT(*) FROM event.quicksurveysresponses WHERE year = 2019 AND month = 3 AND day = 6 AND event.surveyResponseValue IN ('ext-quicksurveys-example-internal-survey-answer-positive', 'ext-quicksurveys-example-internal-survey-answer-negative');
17041

@Fsalutari do you think that will be a sufficient amount to attempt deep learning?

Change 494921 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Oversample navtiming on ruwiki and eswiki

https://gerrit.wikimedia.org/r/494921

Change 494921 abandoned by Gilles:
Oversample navtiming on ruwiki and eswiki

https://gerrit.wikimedia.org/r/494921

Change 494921 restored by Gilles:
Oversample navtiming on ruwiki and eswiki

https://gerrit.wikimedia.org/r/494921

Trizek-WMF unsubscribed.Mar 7 2019, 4:59 PM

Great!
Yes I think so. Let's try

In T187299#5007760, @Gilles wrote:
As a result of the oversampling, we are showing the survey to 10x anonymous users and the same amount of logged-in users as before. We are only getting 3x the responses (and that's true for both eswiki and ruwiki). This shows that there are diminishing returns to displaying the survey more.

As a result, at the current rate we are collecting around 17k non-neutral survey responses per day across all surveyed wikis, which works out to 510k per month, 6+ million per year:
0: jdbc:hive2://an-coord1001.eqiad.wmnet:1000> SELECT COUNT(*) FROM event.quicksurveysresponses WHERE year = 2019 AND month = 3 AND day = 6 AND event.surveyResponseValue IN ('ext-quicksurveys-example-internal-survey-answer-positive', 'ext-quicksurveys-example-internal-survey-answer-negative');
17041
@Fsalutari do you think that will be a sufficient amount to attempt deep learning?

• Gilles removed a project: Patch-For-Review.Mar 12 2019, 7:23 AM

• Gilles lowered the priority of this task from Medium to Low.Apr 30 2019, 12:03 PM

Change 512205 had a related patch set uploaded (by Gilles; owner: Gilles):
[analytics/refinery@master] Retain more performance data

https://gerrit.wikimedia.org/r/512205

gerritbot added a project: Patch-For-Review.May 23 2019, 6:03 PM

Change 512287 had a related patch set uploaded (by Gilles; owner: Gilles):
[performance/navtiming@master] Track performance perception survey impressions

https://gerrit.wikimedia.org/r/512287

Change 512205 merged by Gilles:
[analytics/refinery@master] Retain more performance data

https://gerrit.wikimedia.org/r/512205

Change 514018 had a related patch set uploaded (by Gilles; owner: Gilles):
[analytics/refinery@master] Retain RUMSpeedIndex

https://gerrit.wikimedia.org/r/514018

Change 514018 merged by Gilles:
[analytics/refinery@master] Retain RUMSpeedIndex

https://gerrit.wikimedia.org/r/514018

• Gilles removed a project: Patch-For-Review.Jun 13 2019, 9:49 AM

Change 512287 merged by jenkins-bot:
[performance/navtiming@master] Track performance perception survey impressions

https://gerrit.wikimedia.org/r/512287

Krinkle moved this task from Doing (old) to To-do: Goals prioritized current Quarter on the Performance-Team board.Aug 6 2019, 1:11 AM

• Gilles moved this task from To-do: Goals prioritized current Quarter to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.Sep 30 2019, 8:38 PM

• Gilles lowered the priority of this task from Low to Lowest.Oct 24 2019, 8:28 AM

• Gilles closed subtask T217318: A Large-scale Study of Wikipedia Users' Quality of Experience: data release as Resolved.Dec 13 2019, 9:51 AM

• Gilles closed this task as Resolved.Feb 25 2020, 2:59 PM

• Gilles closed subtask T224253: Add secondary question(s) for the performance study as Declined.

• Gilles closed subtask T224248: Record order of randomized survey options as Declined.

• Gilles closed subtask T224252: Memorize and store prior pageviews performance in the current session as Declined.

Aklapper removed subscribers: • Imarlier, Stashbot, gerritbot, • Tbayer.Oct 8 2020, 9:47 AM

Krinkle changed the visibility from "Custom Policy" to "Custom Policy".Sep 3 2021, 3:30 PM

Krinkle changed the visibility from "Custom Policy" to "Public (No Login Required)".Sep 8 2021, 3:15 PM

Peter mentioned this in T291934: Questions on the use of QuickSurvey by the Performance Team.Sep 28 2021, 2:42 PM

AAlhazwani-WMF mentioned this in T291496: Investigation of QS opt-out capabilities [16h].Sep 29 2021, 8:12 AM

Ladsgroup edited projects, added User-notice-archive; removed User-notice.Aug 13 2022, 1:55 PM

Krinkle added a subtask: T180667: Collect RUMSpeedIndex from users.Apr 12 2023, 3:42 AM

	F28329868: Capture d'écran 2019-03-05 12.01.09.png
	Mar 5 2019, 11:03 AM

	F28329873: Capture d'écran 2019-03-05 12.02.54.png
	Mar 5 2019, 11:03 AM

User-perceived page load performance studyClosed, ResolvedPublicActions

Description

See also

Details

Related ObjectsSearch...

Event Timeline

User-perceived page load performance study
Closed, ResolvedPublic
Actions

Related Objects
Search...