User-perceived page load performance study
Open, NormalPublic

Description

Research page: https://meta.wikimedia.org/wiki/Research:Study_of_performance_perception

Introduction

The current metrics we use to measure page load performance are based on assumptions about what users prefer. Even the assumption that faster is always better is universal in the metrics used in this field, while academic research shows that this might not be a universal main criterion to assess the quality of experience (performance stability might be preferred) and is likely to depend on the subject and the context.

We usually deal with two classes of metrics, real user metrics (RUM) that we collect passively from users, leveraging the performance information that we can capture client-side. It's usually very low level, highly granular and quite disconnected from the experience from the user's perspective. The other type of metric we use is synthetic, where we have automated tools simulate the user experience and measure things. These get closer to the user experience, by allowing us to measure visual characteristics of a page load. But both are far from capturing what the page feels like to users, because when the measurement is made, they don't require any human input. Even their modeling is often just a best guess by engineers, and only recently have studies looked at the correlation between those metrics and user sentiment. It wasn't part of the metrics' design.

In this study, we would like to bridge the gap between user sentiment and the passive RUM performance metrics that are easy to collect unobtrusively.

Collecting user-perceived page load performance

A lot of studies put users' nose in the mechanics of the page load to ask them about it. For example, showing them 2 videos side by side of the same page loading differently. This is a very artificial exercise and disconnected from the real-world experience of loading a page in the middle of a user flow. In this study we want to avoid interfering with the page load experience. Which is why we plan to ask the real users, on the production wikis, about their experience after a real page load has finished, in the middle of their browsing session. After a random wiki page load has happened, the user viewing it is asked via an in-page popup to score how fast or pleasant that page load was.

Users will have an option to opt out of this surveying permanently (preference stored in local storage). It might be interesting to give them different options to dismiss it (eg. "I don't want to participate", "I don't understand the question") in order to tweak the UI if necessary.

Collecting RUM metrics alongside

This is very straightforward, as we already collect such metrics. These need to be bundled with the survey response, in order to later look for correlations. In addition to performance metrics, we should bundle anonymized information about things that could be relevant to performance (user agent, device, connection type, location, page type, etc.). Most of this information is already being collected by the NavigationTiming extension and we could simply build the study on top of that.

Attempting to build a model that lets us derive user-perceived performance scores from RUM data only

Once we have user-perceived performance scores and RUM data attached to it, we will attempt to build a model that reproduces user-perception scores based on the underlying RUM metrics.

We can try building a universal model at first, applying to all users and all pages on the wiki. And then attempt to build context-specific models (by wiki, page type, connection type, user agent, location, etc.) to see if we could get better correlation.

Ideally, given the large amount of RUM data we can collect (we could actually collect more than we currently do), we would be trying the most exhaustive set of features possible. We should try both expert models and machine learning, as prior work has shown that they can both give satisfying results in similar contexts.

Scope

While it would be nice to have the user-perceived performance scores collected on all wikis, some have communities that are less likely to welcome such experiments by the WMF. We could focus the initial study on wikis that are usually more friendly to cutting-edge features, such as frwiki or cawiki. Doing at least 2 wikis would be good, in order to see if the same model could work for different wikis, or if we're already finding significant differences between wikis.

This study will focus only on the desktop website. It can easily be extended to the mobile site or even the mobile apps later, but for the sake of validating the idea, focusing on a single platform should be enough. There is no point making the study multi-platform if we don't get the results we hope for on a single one.

Challenges

  • Picking when to ask. How soon in the page load lifecycle is too soon to ask? (the user might not consider the page load to be finished yet) How late is too late? (the user might have forgotten how the page load felt)
  • Does the survey pop-up interfere with the page load perception itself? We have to display a piece of UI on screen to ask the question, and it's part of the page. We need to try to limit the effect that this measurement has as much as possible. This should be one of the main criteria in the design, that this survey UI's appearance doesn't feel like it's part of the initial page load.
  • How should the question be asked? Phrasing matters. If we ask a question too broad (eg. are you having a pleasant experience?) people might answer thinking about a broader context, like their entire browsing session, the contents of the page, or whether or not they found what they wanted to find on the wiki. If the question is too narrow, it might make them think too much about page loading mechanics they normally don't think about.
  • What grading system should we use? There are a number of psychological effects at play when picking a score for something, and we should be careful to pick the model that's the most appropriate for this task.

Limitations

This study won't look at performance stability. For example, if the page loads before the one being surveyed were unusually fast or slow, this will likely affect the perception of the current one. We could explore that topic more easily in a follow-up study if we identify meaningful RUM metrics in this initial study limited to page load studies in isolation.

Expected outcomes

  • We don't find any satisfying correlation between any RUM-based model, even sliced by page type/wiki/user profile. This informs us, and the greater performance community, that RUM metrics are a poor measurement of user-perceived performance. It would be a driving factor to implement new browser APIs that measure performance metrics closer to what users really experience. And in the short term it would put a bigger emphasis on synthetic metrics as a better reference for user-perceived performance (as there has been academic work showing a decent correlation there already). It could also drive work into improving synthetic metrics further. Also, from an operational perspective, if we keep the surveys running indefinitely, we would still get to measure user-perceived performance globally as a metric we can follow directly. It will be harder to make it actionable, but we would know globally if user sentiment is getting better or worse over time and slice it by different criteria.
  • We find a satisfying RUM-based universal model. Depending on its characteristics, we can assess whether or not it's a wiki-specific one, or if we potentially uncovered a universal understanding, that could be verified in follow-up studies done by others on other websites.
  • We find a satisfying RUM-based model adapted to some context. This would change the way performance optimization is done, by showing that context matters, meaning that improving performance might not take the form of a one-size-fits-all solution.

In the last 2 cases, this would allow us to have a universal performance metric that we can easily measure passively at scale and that we know is a good representation of user perception. This would be a small revolution in the performance field, where currently the user experience and the passive measurements are completely disconnected.


See also

The following url shows the survey unconditionally (Note: submissions are real!)

https://ca.wikipedia.org/wiki/Plantes?quicksurvey=internal-survey-perceived-performance-survey

The following dashboard shows ingestion of responses (Note: This can include other surveys in the future, although as of writing, no other ones are enabled).

https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?var-schema=QuickSurveysResponses

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 421278 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/core@master] Implement mw.user.getPageviewId

https://gerrit.wikimedia.org/r/421278

Change 421280 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/QuickSurveys@master] Add support for performance survey

https://gerrit.wikimedia.org/r/421280

Change 421283 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Add support for performance survey

https://gerrit.wikimedia.org/r/421283

Jdlrobson added subscribers: ovasileva, Jdlrobson.

Ping @ovasileva as maintainers of QuickSurveys we should help review the QuickSurveys patch from the performance team.

Restricted Application added a project: Readers-Web-Backlog. · View Herald TranscriptMar 23 2018, 5:31 PM

Change 421921 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Add performance perception QuickSurvey definition

https://gerrit.wikimedia.org/r/421921

Change 421278 merged by jenkins-bot:
[mediawiki/core@master] mediawiki.user: Implement mw.user.stickyRandomId

https://gerrit.wikimedia.org/r/421278

Change 422159 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/WikimediaMessages@master] Add performance perception survey wording

https://gerrit.wikimedia.org/r/422159

Change 421280 merged by jenkins-bot:
[mediawiki/extensions/QuickSurveys@master] Easier survey invocation

https://gerrit.wikimedia.org/r/421280

Ping me if you need any thing from readers web going forward! Thanks!

@Gilles I want to try it out, what's the easiest way to do it?

Change 422409 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/vagrant@master] Add performance perception survey

https://gerrit.wikimedia.org/r/422409

@Gilles I want to try it out, what's the easiest way to do it?

Make sure that you have up-to-date MediaWiki core + QuickSurveys extension.

Apply this NavigationTiming patch: https://gerrit.wikimedia.org/r/#/c/421283/
Apply this WikimediaMessages patch: https://gerrit.wikimedia.org/r/#/c/422159/
Apply this Vagrant patch (or use similar configuration if you're testing this without Vagrant): https://gerrit.wikimedia.org/r/#/c/422409/

Enable QuickSurveys, NavigationTiming and WikimediaMessages extensions.

Visit any article other than the main page. The survey should appear in the top-right corner of the article.

Gilles moved this task from Next-up to Doing on the Performance-Team board.Apr 25 2018, 10:44 AM

Change 422409 merged by jenkins-bot:
[mediawiki/vagrant@master] Add performance perception survey

https://gerrit.wikimedia.org/r/422409

Change 421283 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Add support for performance survey

https://gerrit.wikimedia.org/r/421283

Change 422159 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMessages@master] Add performance perception survey wording

https://gerrit.wikimedia.org/r/422159

Change 429862 had a related patch set uploaded (by Gilles; owner: Gilles):
[mediawiki/extensions/NavigationTiming@master] Add missing dependency on mediawiki.user

https://gerrit.wikimedia.org/r/429862

Change 429862 merged by jenkins-bot:
[mediawiki/extensions/NavigationTiming@master] Add missing dependency on mediawiki.user

https://gerrit.wikimedia.org/r/429862

Change 421921 merged by jenkins-bot:
[operations/mediawiki-config@master] Add performance perception QuickSurvey definition

https://gerrit.wikimedia.org/r/421921

Gilles added a comment.May 2 2018, 1:25 PM
/wiki/Sp%C3%A9cial:Version   InvalidArgumentException from line 58 of /srv/mediawiki/php-1.32.0-wmf.1/extensions/QuickSurveys/includes/SurveyFactory.php: The "perceived-performance-survey" survey doesn't have a coverage.

Change 430363 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Fix stray space in quicksurvey configuration

https://gerrit.wikimedia.org/r/430363

Change 430363 merged by jenkins-bot:
[operations/mediawiki-config@master] Fix stray space in quicksurvey configuration

https://gerrit.wikimedia.org/r/430363

Mentioned in SAL (#wikimedia-operations) [2018-05-02T13:37:12Z] <gilles@tin> Synchronized wmf-config/InitialiseSettings.php: T187299 Add performance perception QuickSurvey definition (duration: 01m 17s)

Lofhi added a subscriber: Lofhi.May 11 2018, 10:17 PM
stjn added a subscriber: stjn.May 14 2018, 9:56 AM

I'm going to target roughly 100 survey impressions per day on cawiki, which according to Pivot see a bit more than 1 million pageviews per day. That's 0.01% of pageviews getting the survey.

Enwikivoyage sees a lot less traffic, about 157000 pageviews per day. There I'm targeting roughly 50 survey impressions per day, which works out to 0.03% of pageviews.

For frwiki we're getting 26 million pageviews per day, I'm targeting 1000 survey impressions per day, which is 0.004% of pageviews.

Finally on ruwiki we're getting 31 million pageviews per day, I'm targeting 1500 survey impressions per day, which is 0.005% of pageviews.

In terms of sampling ratios, based on the amount of hits NavigationTiming currently gets, that gives us:

'wgNavigationTimingSurveySamplingFactor' => [
	// Sub-factor of the aboven the lower the value, the more the survey
	// will be displayed to users
	'default' => 0,
	'cawiki' => 3,
	'enwikivoyage' => 2,
	'frwiki' => 15,
	'ruwiki' => 10,
],

Change 434641 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Launch performance survey on cawiki and enwikivoyage

https://gerrit.wikimedia.org/r/434641

Krinkle updated the task description. (Show Details)Wed, May 23, 5:06 PM

[Perhaps a note about this survey can be added in TechNews, it concerns only a few wikis but in theory they represent the whole farm.]

Change 434641 merged by jenkins-bot:
[operations/mediawiki-config@master] Launch performance survey on cawiki and enwikivoyage

https://gerrit.wikimedia.org/r/434641

Mentioned in SAL (#wikimedia-operations) [2018-05-24T08:32:21Z] <gilles@tin> Synchronized wmf-config/InitialiseSettings.php: T187299 Launch performance survey on cawiki and enwikivoyage (duration: 01m 08s)

We can check survey impressions with this (whether people respond or not): https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1&refresh=1m&from=now-24h&to=now&var-schema=QuickSurveyInitiation

The first 3 impressions today were me testing.

We got a trickle of responses since the survey started a bit less than 24 hours ago. The ratio is about 2% of survey impressions getting a response. I'm going to increase the rate for cawiki and enwikivoyage while I enable the survey for frwiki shortly, this is a much lower response ratio than I expected.

Anecdotally, in the 6 responses we've gotten, we have our first "no" from an anonymous reader, with a 4g effective connection type, a firstPaint of 1229 and a loadEventEnd of 1431 on this tiny article: https://ca.wikipedia.org/wiki/Roca_volc%C3%A0nica

Change 435123 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Enable performance survey on frwiki

https://gerrit.wikimedia.org/r/435123

Change 435123 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable performance survey on frwiki

https://gerrit.wikimedia.org/r/435123

Mentioned in SAL (#wikimedia-operations) [2018-05-25T08:26:18Z] <gilles@tin> Synchronized wmf-config/InitialiseSettings.php: T187299 Launch performance survey on frwiki (duration: 01m 22s)

We're starting to have a bit of data, with the survey running on frwiki over the weekend.

SELECT COUNT(*), AVG(event_firstPaint), MIN(event_firstPaint), MAX(event_firstPaint), event_surveyResponseValue FROM QuickSurveysResponses_15266417 INNER JOIN NavigationTiming_17975587 ON event_surveyInstanceToken = event_stickyRandomSessionId WHERE event_firstPaint IS NOT NULL AND event_surveyCodeName = 'perceived-performance-survey' GROUP BY event_surveyResponseValue

COUNT	AVG	MIN	MAX
5	1952.60	611	3912	ext-quicksurveys-example-internal-survey-answer-negative
7	1119.43	432	1960	ext-quicksurveys-example-internal-survey-answer-neutral
70	1181.77	192	7863	ext-quicksurveys-example-internal-survey-answer-positive
Gilles added a comment.EditedMon, May 28, 9:03 AM

All of the following is too premature given that we have only 100ish data points, but I'm looking already to figure out what we can look at, and we'll re-run the numbers when we have more data. I'm not looking at wikis separately yet.

Having a quick look at the data, the absolute Pearson correlation coefficient for numerical NavTiming metrics is the highest for firstPaint, with a modest -0.17 value (coefficient can go from -1 to 1). Which does confirm that the highest the firstPaint, the highest the likelihood that the user will consider that the page doesn't load fast enough. In comparison, fetchStart is a terrible predictor of user sentiment, with a coefficient of 0.039 (it should be negative!). For comparison's sake, it's as well correlated as the pageId!

Notably, RumSpeedIndex fares not great with -0.086

Non-numerical types, like effective connection type, need to be converted to a numerical value first. Doing a dumb conversion of: slow-2g => 1, 2g => 2, 3g => 3, 4g => 4 gives a correlation coefficient of 0.097.

Looking at yesterday, we're getting about 5100 impressions per day on frwiki (0.02% of all pageviews), which yields about 33 responses (0.6% impression-to-response ratio, which is worse than on the other wikis). I'm going to crank impressions up on frwiki while I enable the survey on ruwiki.

Change 435969 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/mediawiki-config@master] Enable performance survey on ruwiki

https://gerrit.wikimedia.org/r/435969

Change 435969 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable performance survey on ruwiki

https://gerrit.wikimedia.org/r/435969

Mentioned in SAL (#wikimedia-operations) [2018-05-29T07:47:45Z] <gilles@tin> Synchronized wmf-config/InitialiseSettings.php: T187299 Launch performance survey on ruwiki (duration: 01m 50s)

Wonder how many people will mark "no" because the slowest loading thing on the page is the survey - enough that it causes very visible page jumping.

Gilles added a comment.EditedWed, May 30, 6:24 AM

It's a possibility, but due to caching for logged-out users we can't have the survey injected in page content in any other way than with javascript, which means late loading one way or another.

We could tweak its layout so that it's an overlay on top of the page rather than something that pushes down infobox content, but it makes the survey more obnoxious and it would still appear late, just presented differently. It's unclear that this significant amount of extra work would nudge the ratio of "no" responses, given that we wouldn't get rid of the fact that it's late-loading.

One thing we could explore is keeping the survey as-is, but injecting it server-side in its place for logged-in users (since pages are uncached for logged-in traffic). Then we could see if this change affects the ratio of "no"s for logged-in users. For now we should keep it as-is to have a significant amount of data for logged-in users to be able to do this comparison later. If the ratio changes when the survey is rendered server-side in a statistically significant manner, we might be able to extrapolate the effect it would have had on logged-out users, were we able to inject the survey server-side. This is still a significant undertaking, as currently the sampling logic is piggy-backing on NavigationTiming, which makes this decision in JS about whether to show the survey or not for that particular pageload.

Gilles added a comment.Mon, Jun 4, 3:44 PM

Looking at the overall data on hive now:

0: jdbc:hive2://analytics1003.eqiad.wmnet:100> set hive.auto.convert.join.noconditionaltask=false;
0: jdbc:hive2://analytics1003.eqiad.wmnet:100> SELECT COUNT(*) AS count, ROUND(PERCENTILE(n.event.firstPaint, 0.5)) AS median, ROUND(PERCENTILE(n.event.firstPaint, 0.9)) AS p90, ROUND(PERCENTILE(n.event.firstPaint, 0.95)) AS p95, q.event.surveyResponseValue FROM event.quicksurveysresponses q INNER JOIN event.navigationtiming n ON q.event.surveyInstanceToken = n.event.stickyRandomSessionId WHERE q.year = 2018 AND n.year = 2018 AND q.event.surveyCodeName = "perceived-performance-survey" AND n.event.firstPaint IS NOT NULL GROUP BY q.event.surveyResponseValue;

count	median	p90	p95	surveyresponsevalue
168	1172.0	6139.0	11998.0	ext-quicksurveys-example-internal-survey-answer-negative
172	1124.0	3396.0	6975.0	ext-quicksurveys-example-internal-survey-answer-neutral
1818	851.0	2424.0	3567.0	ext-quicksurveys-example-internal-survey-answer-positive

I think this already bring a reassuring obvious - but to this point unverified - confirmation that on average, the smaller the firstPaint, the more likely are users to consider that the page loaded fast enough.

Now, looking at the per-wiki breakdown of these figures, the results on French Wikipedia are the exception to that trend, with a very strange lower median for the negative response (only based on 16 responses, however):

count	median	p90	p95	q.wiki	surveyresponsevalue
17	1614.0	3772.0	6270.0	cawiki	negative
14	895.0	3229.0	5183.0	cawiki	neutral
116	849.0	2352.0	2698.0	cawiki	positive

2	2116.0	2310.0	2335.0	enwikivoyage	negative
3	1544.0	1702.0	1721.0	enwikivoyage	neutral
15	801.0	1998.0	2732.0	enwikivoyage	positive

16	679.0	3104.0	4872.0	frwiki	negative
24	1086.0	2705.0	4501.0	frwiki	neutral
230	800.0	2464.0	3528.0	frwiki	positive

133	1145.0	6901.0	12158.0	ruwiki	negative
131	1135.0	3416.0	7121.0	ruwiki	neutral
1457	854.0	2419.0	3642.0	ruwiki	positive

We'll have to revisit this once there are more negative responses with firstPaint attached on French Wikipedia.

Gilles added a comment.Mon, Jun 4, 3:57 PM

Looking at the breakdown per "effective connection type", exclusing frwiki for now:

SELECT COUNT(*) AS COUNT, ROUND(PERCENTILE(n.event.firstPaint, 0.5)) AS MEDIAN, ROUND(PERCENTILE(n.event.firstPaint, 0.9)) AS p90, ROUND(PERCENTILE(n.event.firstPaint, 0.95)) AS p95, n.event.netinfoEffectiveConnectionType, q.event.surveyResponseValue FROM event.quicksurveysresponses q INNER JOIN event.navigationtiming n ON q.event.surveyInstanceToken = n.event.stickyRandomSessionId WHERE q.year = 2018 AND n.year = 2018 AND q.event.surveyCodeName = "perceived-performance-survey" AND n.event.firstPaint IS NOT NULL AND q.wiki != "frwiki" GROUP BY n.event.netinfoEffectiveConnectionType, q.event.surveyResponseValue;

count	median	p90	p95	netinfoeffectiveconnectiontype	surveyresponsevalue
24	1283.0	4372.0	4833.0	NULL	negative
21	1216.0	8010.0	8468.0	NULL	neutral
148	1061.0	2696.0	4149.0	NULL	positive

2	6235.0	6427.0	6451.0	2g	negative
3	2417.0	2896.0	2956.0	2g	positive

16	3482.0	19523.0	33735.0	3g	negative
16	1784.0	28497.0	58920.0	3g	neutral
133	1717.0	5691.0	9313.0	3g	positive

107	1017.0	3634.0	8124.0	4g	negative
111	991.0	2604.0	4053.0	4g	neutral
1297	804.0	2055.0	2854.0	4g	positive

3	18008.0	37206.0	39605.0	slow-2g	negative
7	4589.0	23088.0	24361.0	slow-2g	positive
SMcCandlish updated the task description. (Show Details)Mon, Jun 4, 11:07 PM
Gilles added a comment.Tue, Jun 5, 9:52 AM

Looking into a suggestion made by @Peter

Ignoring neutral (not sure) responses and frwiki, let's look at the percentage of yes responses (i.e. the user considered that the page loaded fast enough) using different firstPaint thresholds:

firstPaint <= 300ms: 95%
300ms < firstPaint <= 400ms: 93.3%
400ms < firstPaint <= 500ms: 91%
500ms < firstPaint <= 600ms: 95.2%
600ms < firstPaint <= 700ms: 95.5%
700ms < firstPaint <= 800ms: 93.9%
800ms < firstPaint <= 900ms: 93.4%
900ms < firstPaint <= 1s: 93%
1s < firstPaint <= 1.1s: 90.3%
1.1s < firstPaint <= 1.25s: 90.5%
1.25s < firstPaint <= 1.5s: 90.8%
1.5s < firstPaint <= 2s: 91.9%
2s < firstPaint <= 3s: 88.9%
3s < firstPaint <= 4s: 85.7%
4s < firstPaint <= 5s: 78.5% (28 total samples in that bucket, take with a grain of salt)
5s < firstPaint: 68.7%

Example query used to calculate this data:

SELECT COUNT(*) AS COUNT, q.event.surveyResponseValue FROM event.quicksurveysresponses q INNER JOIN event.navigationtiming n ON q.event.surveyInstanceToken = n.event.stickyRandomSessionId WHERE q.year = 2018 AND n.year = 2018 AND q.event.surveyCodeName = "perceived-performance-survey" AND n.event.firstPaint IS NOT NULL AND q.wiki != "frwiki" AND n.event.firstPaint > 4000 AND n.event.firstPaint <= 5000 GROUP BY q.event.surveyResponseValue;

So far all queries were done without differentiating the mobile and desktop versions of the wiki. For a given slice of firstPaint values, let's see if opinions differ between the mobile and desktop site:

SELECT COUNT(*) AS COUNT, n.event.mobileMode, q.event.surveyResponseValue FROM event.quicksurveysresponses q INNER JOIN event.navigationtiming n ON q.event.surveyInstanceToken = n.event.stickyRandomSessionId WHERE q.year = 2018 AND n.year = 2018 AND q.event.surveyCodeName = "perceived-performance-survey" AND n.event.firstPaint IS NOT NULL AND q.wiki != "frwiki" AND n.event.firstPaint > 500 AND n.event.firstPaint <= 1000 GROUP BY n.event.mobileMode, q.event.surveyResponseValue;

count	mobilemode	surveyresponsevalue
17	NULL	ext-quicksurveys-example-internal-survey-answer-negative
26	NULL	ext-quicksurveys-example-internal-survey-answer-neutral
380	NULL	ext-quicksurveys-example-internal-survey-answer-positive

1	beta	ext-quicksurveys-example-internal-survey-answer-positive

21	stable	ext-quicksurveys-example-internal-survey-answer-negative
25	stable	ext-quicksurveys-example-internal-survey-answer-neutral
270	stable	ext-quicksurveys-example-internal-survey-answer-positive

With firstPaint between 500ms and 1s, desktop users have a satisfaction of 95.7%. In the same firstPaint range, mobile users are 92.7% satisfied.

Given the ranges we've been look at so far, it's a significant difference. This might suggest that firstPaint has worse correlation to perception performance on mobile. It could be that less powerful devices mean that the page could still take time to construct, or simply that getting the beginning of the content fast doesn't guarantee network stability for the rest of the request on mobile. I think it would be interesting to contrast firstPaint to metrics at the other end of the pageload lifecycle, like loadEventEnd.

Gilles added a comment.Tue, Jun 5, 1:10 PM

This seems to be confirmed by looking at the median loadEventEnd for the firstPaint slice investigated in my previous comment. For negative responses on the desktop site, the loadEventEnd median is 1023. For the mobile site it's 1318. For positive responses it's 934 and 1059, respectively.

This suggests that the span of time between firstPaint and the pageload being completely done sees wider variations on mobile than on desktop, which explains why firstPaint is worst predictor of user performance perception on mobile. It also confirms that the perception of the page being fast enough or not takes into account things that happen after firstPaint.

Gilles added a comment.Tue, Jun 5, 2:30 PM

Based on these findings, early firstPaint not guaranteeing a fast pageload past that point, particularly on mobile, we can look at loadEventEnd buckets the same way we looked at firstPaint. Because in that case a fast loadEventEnd does guarantee a fast firstPaint (if measurable, depending on the client).

Sample query:

SELECT COUNT(*) AS COUNT, q.event.surveyResponseValue FROM event.quicksurveysresponses q INNER JOIN event.navigationtiming n ON q.event.surveyInstanceToken = n.event.stickyRandomSessionId WHERE q.year = 2018 AND n.year = 2018 AND q.event.surveyCodeName = "perceived-performance-survey" AND n.event.loadEventEnd IS NOT NULL AND q.wiki != "frwiki" AND n.event.loadEventEnd > 500 AND n.event.loadEventEnd <= 1000 GROUP BY q.event.surveyResponseValue;

Results (percentage of yes responses):

loadEventEnd <= 300ms: 97.4%
300ms < loadEventEnd <= 400ms: 93.7%
400ms < loadEventEnd <= 500ms: 93.6%
500ms < loadEventEnd <= 600ms: 93.9%
600ms < loadEventEnd <= 700ms: 92%
700ms < loadEventEnd <= 800ms: 94.5%
800ms < loadEventEnd <= 900ms: 95.3%
900ms < loadEventEnd <= 1s: 96%
1s < loadEventEnd <= 1.1s: 95.7%
1.1s < loadEventEnd <= 1.25s: 93%
1.25s < loadEventEnd <= 1.5s: 90.8%
1.5s < loadEventEnd <= 2s: 91.7%
2s < loadEventEnd <= 3s: 90.6%
3s < loadEventEnd <= 4s: 91.6%
4s < loadEventEnd <= 5s: 83.6%
5s < loadEventEnd: 73.3%

The figures aren't so different from firstPaint, besides the pronounced peak in the middle. This might suggest that the users who voted no did so for reasons that the individual RUM metrics we currently collect can't capture. Or possibly that they were influenced by the late loading of the survey itself, and/or by a CentralNotice banner. It will be interesting to revisit these figures once T195840: Track when a CentralNotice banner was displayed to the user in NavTiming and T196163: Add ability to render/inject QuickSurveys server-side to loggedin users have been completed, with a way to filter out situations where dynamic content loaded late in the initial pageload.

Gilles added a comment.Fri, Jun 8, 8:06 PM

Having discovered the complexity of the survey insertion logic, I'm going to take a manual peek at the kind of articles where people answered "no", to see if there's any pattern. Reading the insertion code, it seems that the intent is for the survey to insert itself more or less in the same area regardless of the article content. However, the visibility code planned for the possibility of the survey not being visible initially (below the fold), so there might be some interesting edge cases.

Looking at the last 20 articles where people clicked "no", the survey always appeared at the expected spot at the top right of the article. However, in one case it was quite low because of a long first paragraph and could have been below the fold on mobile:

https://ru.m.wikipedia.org/wiki/%D0%9E%D0%B4%D0%B0%D1%80%D1%91%D0%BD%D0%BD%D0%BE%D1%81%D1%82%D1%8C?quicksurvey=internal-survey-perceived-performance-survey

And in another case it was just in the wrong spot, after thumbnails:

https://fr.m.wikipedia.org/wiki/Fossette?quicksurvey=internal-survey-perceived-performance-survey

I did find one bug, where the way the survey is loaded by NavigationTiming means that it can appear on namespaces where it shouldn't, like category pages. Filed as T196772: Performance survey shouldn't appear on category pages

As established in T196163: Add ability to render/inject QuickSurveys server-side to loggedin users with QuickSurveys we can't escape the late loading of the survey. But thanks to the impressions recording, we could have an idea of how long the survey took to show up. Currently the timestamp isn't fine grained enough, but we could record a performance timing, and compare it to what NavigationTiming has recorded. Filed as T196775: Record monotonic time of survey impression