Page MenuHomePhabricator

Analyze results of enwiki and dewiki page previews a/b test
Closed, ResolvedPublic

Assigned To
Authored By
ovasileva
Dec 7 2017, 12:42 PM

Description

Background

Prior to deployment, we would like to complete the analysis of the page previews a/b test for page previews.

Acceptance criteria

The main results have also been summarized at https://www.mediawiki.org/wiki/Page_Previews/2017-18_A/B_Tests .

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ovasileva triaged this task as Medium priority.Dec 7 2017, 12:42 PM
ovasileva raised the priority of this task from Medium to High.Dec 7 2017, 12:42 PM

The daily percentage of pageviews where the new feature was used (i.e. at least one preview card was seen for >= 1000ms) was between 12 and 13% on both wikis:

Percentage of pageviews where at least one preview card was seen at least 1s (dewiki and enwiki Oct-Nov 2017).png (494×554 px, 38 KB)

Data from

SELECT 
year, month, day, CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS date, 
wiki,
SUM( IF(cards_1000viewed_on_page > 0, 1, 0) ) / SUM(1) AS pc_cards_were_viewed,
SUM(1) AS pageviews
FROM (
  SELECT wiki,
  year, month, day,
  event.pageToken,
  COUNT(DISTINCT event.linkInteractionToken) AS cards_1000viewed_on_page
  FROM tbayer.Popups
  WHERE  
  (event.action = 'pageLoaded'
  OR (CAST(event.totalInteractionTime AS BIGINT) >= CAST(event.perceivedWait AS BIGINT) + 1000 ))
  AND wiki IN ('enwiki', 'dewiki')
  AND event.isAnon = 1
  AND event.popupEnabled = 1
  AND year = 2017
  GROUP BY wiki, 
    year, month, day,
    event.popupEnabled, event.pageToken) AS pageviewlist
GROUP BY year, month, day, wiki
ORDER BY year, month, day, wiki LIMIT 10000

As expected, pageviews (per session, i.e. "sesssion depth") were somewhat lower with previews enabled:

  • dewiki: 2.05 for previews OFF vs. 2.00 for previews ON (-2.6% drop)
  • enwiki: 2.03 for previews OFF vs. 1.95 for previews ON (-3.7% drop)

Considering the sample sizes and a separate comparison of the daily version of this session depth (to be posted later), it is now clear that this drop is not due to random variation.

Data via

SELECT wiki,
 event.popupEnabled,
 COUNT(*) AS pageloaded_events,
 COUNT(DISTINCT event.pageToken) AS pageTokens,
 COUNT(DISTINCT event.sessionToken) AS sessions,
 COUNT(DISTINCT event.pageToken)/COUNT(DISTINCT event.sessionToken) AS page_tokens_per_session
 FROM tbayer.Popups
 WHERE 
  wiki IN ('enwiki', 'dewiki')
  AND event.isAnon = 1
  AND year = 2017
  AND ( (month = 10 AND day >= 23) OR (month = 11 AND day <= 12) )
 AND event.action = 'pageLoaded'
 GROUP BY wiki, event.popupEnabled
 ORDER BY wiki, event.popupEnabled  LIMIT 10000
wikievent.popupenabledpageloaded_eventspagetokenssessionspage_tokens_per_session
dewikiFalse6451916644101631442312.048519
dewikiTrue6290223627959331471731.995312
enwikiFalse2151047021469097105916972.026974
enwikiTrue2069264120652841105826541.951575

The average number of cards viewed per pageview was around 0.26 on both wikis.

Data via

SELECT wiki,
COUNT(DISTINCT event.pageToken) AS pageTokens,
COUNT(DISTINCT event.sessionToken) AS sessions,
COUNT(DISTINCT event.pageToken)/COUNT(DISTINCT event.sessionToken) AS page_tokens_per_session,
COUNT(DISTINCT event.linkInteractionToken) AS cards_1000viewed,
COUNT(DISTINCT event.linkInteractionToken)/COUNT(DISTINCT event.pageToken) AS cards_1000viewed_per_pageview,
COUNT(DISTINCT event.linkInteractionToken)/COUNT(DISTINCT event.sessionToken) AS cards_1000viewed_per_session
FROM tbayer.Popups
WHERE 
  wiki IN ('enwiki', 'dewiki')
  AND event.isAnon = 1
  AND event.popupEnabled = 1
  AND year = 2017
  AND ( (month = 10 AND day >= 23) OR (month = 11 AND day <= 12) )  
  AND (event.action = 'pageLoaded'
    OR (CAST(event.totalInteractionTime AS BIGINT) >= CAST(event.perceivedWait AS BIGINT) + 1000 ))
 GROUP BY wiki
 ORDER BY wiki  LIMIT 10000
wikipagetokenssessionspage_tokens_per_sessioncards_1000viewedcards_1000viewed_per_pageviewcards_1000viewed_per_session
dewiki627986031472961.995319156507681516395370.26107859092400150.5209351138246927
enwiki20654102105832711.951580187259685653270750.257918499676238640.503348633895891

(edited to correct, after accidentally leaving out the limitation to the test group in the query)

Disable rates:
The feature was disabled in around 0.01% of sessions (on both wikis).

Data from:

SELECT wiki,
event.popupEnabled,
SUM(IF(event.action = 'disabled',1,0)) AS disables,
COUNT(DISTINCT event.sessionToken) AS sessions,
SUM(IF(event.action = 'disabled',1,0))/COUNT(DISTINCT event.sessionToken) AS disables_per_session
FROM tbayer.Popups
WHERE 
  wiki IN ('enwiki', 'dewiki')
  AND event.isAnon = 1
  AND year = 2017
  AND ( (month = 10 AND day >= 23) OR (month = 11 AND day <= 12) )
 GROUP BY wiki, event.popupEnabled
 ORDER BY wiki, event.popupEnabled  LIMIT 10000
wikievent.popupenableddisablessessionsdisables_per_session
dewikiFalse32331454040.000103
dewikiTrue031481890.000000
enwikiFalse957105967320.000090
enwikiTrue0105871400.000000

(Note that when an anonymous user in a session with previews enabled chooses to disable the feature, the resulting event already has popupEnabled set to false. See also T167365#3469145 and the preceding comments there.)

(Leaving this here for want of a better place:)

I generated histograms of the link interaction timing by action type again. As noted at the time of the previous A/B tests, we didn't quite have enough data yet back then for an entirely smooth plot (the graphs looked choppy because of random variation). This is no longer a concern with the last two experiments (Oct/Nov and Dec/Jan). However, there a kind of comb structure visible now which seems to be point to some kind of rounding issue, perhaps only for some browsers. I may re-generate this with a different bucket size once I have a better sense how the timing the browsers send are quantized.

Frequency of actions per type over time (previews enabled, enwiki, Dec 21, 2017 - Jan 31, 2018) 0-2s vertical zoom.png (546×724 px, 62 KB)

Frequency of actions per type over time (previews enabled, enwiki, Dec 21, 2017 - Jan 31, 2018) 0-2s.png (546×707 px, 48 KB)

Data via

SELECT bucket, action, COUNT(*) AS frequency FROM (
  SELECT FLOOR(CAST(event.totalInteractionTime AS BIGINT)/10) AS bucket, 
  event.action AS action
  FROM event.popups
  WHERE wiki ='enwiki' -- TODO: redo this for dewiki
  AND event.isAnon = 1
  AND ((month = 12 AND day >= 21) OR (month = 1) )
  AND event.popupEnabled = 1
  AND event.linkInteractionToken IS NOT NULL) AS link_interactions_with_bucket
GROUP BY bucket, action
ORDER BY bucket, action LIMIT 10000

...

I generated histograms of the link interaction timing by action type again. As noted at the time of the previous A/B tests, we didn't quite have enough data yet back then for an entirely smooth plot (the graphs looked choppy because of random variation). This is no longer a concern with the last two experiments (Oct/Nov and Dec/Jan). However, there a kind of comb structure visible now which seems to be point to some kind of rounding issue, perhaps only for some browsers. I may re-generate this with a different bucket size once I have a better sense how the timing the browsers send are quantized.

PS: in any case, it doesn't seem to be due to last month's Meltdown/Spectre mitgation changes that reduced the resolution of the performance.now() timer in all major browsers: The histogram from the October/November experiment (still need to post it) looks similarly comb-like.

...

I generated histograms of the link interaction timing by action type again. As noted at the time of the previous A/B tests, we didn't quite have enough data yet back then for an entirely smooth plot (the graphs looked choppy because of random variation). This is no longer a concern with the last two experiments (Oct/Nov and Dec/Jan). However, there a kind of comb structure visible now which seems to be point to some kind of rounding issue, perhaps only for some browsers. I may re-generate this with a different bucket size once I have a better sense how the timing the browsers send are quantized.

PS: in any case, it doesn't seem to be due to last month's Meltdown/Spectre mitgation changes that reduced the resolution of the performance.now() timer in all major browsers: The histogram from the October/November experiment (still need to post it) looks similarly comb-like.

@Jdlrobson noted the getCurrentTimestamp function in index.js, which indeed uses performance.now(); but it seems it was only added in November for the new timestamp field, and is not used for the totalInteractionTime field that is at issue here.

...

I generated histograms of the link interaction timing by action type again. As noted at the time of the previous A/B tests, we didn't quite have enough data yet back then for an entirely smooth plot (the graphs looked choppy because of random variation). This is no longer a concern with the last two experiments (Oct/Nov and Dec/Jan). However, there a kind of comb structure visible now which seems to be point to some kind of rounding issue, perhaps only for some browsers. I may re-generate this with a different bucket size once I have a better sense how the timing the browsers send are quantized.

PS: in any case, it doesn't seem to be due to last month's Meltdown/Spectre mitgation changes that reduced the resolution of the performance.now() timer in all major browsers: The histogram from the October/November experiment (still need to post it) looks similarly comb-like.

Here it is; the pattern is the same, with a periodicity of 5:

Frequency of actions per type over time (previews enabled, enwiki, Oct 23-Nov 12 2017) 0-2s vertical zoom.png (546×724 px, 59 KB)

So since it looks that that weird comb structure has a periodicity of 5 (hundredths of a second), I took a look at the distribution mod 50ms, in the original resolution of 1ms (and limited to >=500ms so as not to distort the result with the huge spike below that):

Frequency of actions per type over time mod 50ms, limited to >=500ms (previews enabled, enwiki, Oct 23-Nov 12 2017).png (537×711 px, 61 KB)

This is all a bit strange, in particular considering that the opened event doesn't' seem to exhibit the same pattern (only dwelledButAbandoned and dismissed), and that 50 is not divisible by 3.

Data via:

SELECT bucket, action, COUNT(*) AS frequency FROM (
  SELECT PMOD(CAST(event.totalInteractionTime AS BIGINT),50) AS bucket, 
  event.action AS action
  FROM tbayer.popups
  WHERE wiki ='enwiki' 
  AND event.isAnon = 1
  AND year = 2017
  AND ( (month = 10 AND day >= 23) OR (month = 11 AND day <= 12) )
  AND event.popupEnabled = 1
  AND event.linkInteractionToken IS NOT NULL
  AND CAST(event.totalInteractionTime AS BIGINT) >= 500 ) AS link_interactions_with_bucket
GROUP BY bucket, action
ORDER BY bucket, action LIMIT 10000

And to be clear, the concern here is not about precision per se - for the purposes of evaluating the A/B test, and to generate this histogram, a resolution of 50ms would be quite OK. Rather, the concern is that the cause for the comb pattern could be a bug like (wild paranoid imagination:) some user agents logging a multiple of the true value.

Tbayer added a subscriber: phuedx.

As discussed in kickoff today, reassigning this temporarily to @phuedx for some engineering attention on the above timing issues (rather than breaking them out into a new task).

Basically, we want to understand what causes these comb-like anomalies in the histogram; at least well enough so that we can conclude they are unlikely to come from a bug that may affect the overall metrics we care about per the task description. I think a first step could be to review the instrumentation code to find out which kind of Javascript-accessible timer the totalInteractionTime field is based on, and then check if there is documentation from browser vendors about its true resolution and accuracy.

Here is yet another view which may be the most instructive way to look at the issue - basically a magnifying glass aimed at a part of the graph from T182314#3949885 (that between 1.00 and 1.20 seconds), likewise in the original millisecond resolution instead of the 1/100 second buckets. Looks like the periodicity may indeed be close to 16⅔ ms (12 repetitions in this 200ms timespan).

Frequency of actions per type over time (previews enabled, enwiki, Oct 23-Nov 12 2017) 1000-1200 ms.png (555×733 px, 54 KB)

Data via

SELECT bucket, action, COUNT(*) AS frequency FROM (
  SELECT CAST(event.totalInteractionTime AS BIGINT) AS bucket, 
  event.action AS action
  FROM tbayer.popups
  WHERE wiki ='enwiki' 
  AND event.isAnon = 1
  AND year = 2017
  AND ( (month = 10 AND day >= 23) OR (month = 11 AND day <= 12) )
  AND event.popupEnabled = 1
  AND event.linkInteractionToken IS NOT NULL) AS link_interactions_with_bucket
GROUP BY bucket, action
ORDER BY bucket, action LIMIT 10000

I'm inclined to believe that this is not a bug but an artefact of the resolution of JS timers, which apparently varies by browser and OS.


The Page Previews EventLogging instrumentation (the instrumentation) maps a stream of UI and JS timer events to a stream of Popups EventLogging events (Popups events). During the mapping process, the timestamp of when the original event occurred is recorded. The timestamp is computed by invoking window.mw.now, which on all of our Grade A browsers maps to window.performance.now. Note well that this timestamp doesn't correspond to the timestamp property of Popups events – this isn't the issue here but we might need to revisit this. Now, when generating either a dismissed or dwelledButAbandoned Popups event, we subtract the timestamp of the current event from that of the event that started the current link interaction and round it to the nearest integer. We do the same when generating opened events elsewhere (but within the same context).

So, we're using high-performance timers, which have a sub-millisecond resolution (even after the Spectre and Meltdown mitigation measures were released) and the totalInteractionTime property calculation looks correct (and is unit tested!).

As I said, the instrumentation maps a stream of UI and JS timer events to Popups events. The reason for the timer events is the way in which we detect that a user has intentionally moved away from a link or a preview and isn't moving their cursor from one to the other: when the user abandons a link, we start a JS timer via window.setTimeout and at the when it fires then, if the user has finished the link interaction, we map it to a dismissed or dwelledButAbandoned Popups event. There's now another timer at play: the internal timer of the browser, which affects when JS timer callbacks are scheduled!

John Resig published a detailed writeup of how JS timers work, which I won't duplicate. What's important here is that: JS timer callbacks aren't guaranteed to be executed at exactly the time you specify and that the distribution of actual execution times around the specified time is a normal distribution; and the scheduling of JS timer callbacks depends on the minimum timer resolution of the OS.

The default timer resolution on Windows is 15 ms – it was quite hard to pin down a source that isn't a .docx for this; also, some sources quote 15.6 ms and others quote 16 ms – which is remarkably close to the magic number in T182314#3954765. Chrom{e,ium} used an OS timer resolution of ~2 ms on Windows for a while but that increased power consumption so dramatically that it was changed back to the default. Anecdotally, [[ https://www.quora.com/How-accurate-are-JavaScript-timers#__w2_Ntw2CfQ_answer | Ryan Grove observed that the resolution of window.setTimeout varies by browser ]] but didn't provide details of the OSes that he was testing on.

Stepping back to look at the opened Popups event case: this event is mapped from a UI event (a click event) and not a JS timer event.

I believe that the comb-like structure of the histograms that @Tbayer has produced can be explained thusly: modern browsers running on modern OSes with browser timer resolutions of ~4 ms – e.g. Chrome (64.0.3282.140) on macOS High Sierra (10.13.1) – produce a smooth distribution of totalInteractionTime values for dismissed and dwelledButAbandoned Popups events and those browsers that have higher browser timer resolutions (circa 16 ms) are producing values distributed normally around multiples of that timer precision.

I'm inclined to believe that this is not a bug but an artefact of the resolution of JS timers, which apparently varies by browser and OS.

Excellent, thanks for the investigation and writeup! This is super informative and should indeed give us enough confidence to proceed under the assumption. I will redo the histograms with a different bucket size to avoid or reduce these Moire artifacts, so that we discern the distributions of the different actions more clearly. Presumably your findings will also be helpful in the context of the performance.now analysis (T186016).

While they should no more be a blocker at this point, the explanation still raises a few further open questions on how the timer resolution causes the comb patterns in detail, see below (in particular, I couldn't verify the hypothesis about modern browsers/OSes in case of Chrome 64 on Mac OS X 10.13).

...

the scheduling of JS timer callbacks depends on the minimum timer resolution of the OS.

So this yields a beautiful explanation why the end time of a dismissed or dwelledButAbandoned link interaction would be distributed in this fashion (quantized by the OS timer resolution, then smeared out by a normal distribution). But what about the start of the link interaction - is that time measurement limited by the IS timer resolution too? If not, why would their difference (interaction.finished - interaction.started) be quantized that way?

I believe that the comb-like structure of the histograms that @Tbayer has produced can be explained thusly: modern browsers running on modern OSes with browser timer resolutions of ~4 ms – e.g. Chrome (64.0.3282.140) on macOS High Sierra (10.13.1) – produce a smooth distribution of totalInteractionTime values for dismissed and dwelledButAbandoned Popups events and those browsers that have higher browser timer resolutions (circa 16 ms) are producing values distributed normally around multiples of that timer precision.

Was this meant to read "OS timer resolution" instead of "browser timer resolution"? In any case, I checked this hypothesis in the case of Chrome 64 on Max OS 10.13 (the most fine-grained versioning available in the parsed user agent field of EventLogging), and it doesn't quite hold - the histogram shows the ~16ms comb pattern at least as much as the general one, or the one for Windows:

Frequency of actions per type over time (previews enabled, enwiki, Chrome 64 on Max OS X 10.13, Dec 21 2017-Feb 7, 2018).png (575×733 px, 64 KB)

Frequency of actions per type over time (previews enabled, enwiki, Windows, Dec 21 2017-Feb 7, 2018).png (555×733 px, 55 KB)

Frequency of actions per type over time (previews enabled, enwiki, Chrome on Max OS X, Dec 21 2017-Feb 7, 2018).png (575×733 px, 57 KB)

Data via:

SELECT bucket, action, COUNT(*) AS frequency FROM (
  SELECT CAST(event.totalInteractionTime AS BIGINT) AS bucket, 
  event.action AS action
  FROM event.popups  
  WHERE 
  ((month = 12 AND day >= 21) OR (month = 1) OR (month = 2 AND day <=7) )  
  AND useragent.browser_major = '64'
  AND useragent.os_minor = '13'
  AND useragent.os_major = '10'  
  AND useragent.browser_family = 'Chrome'  
  AND useragent.os_family = 'Mac OS X'
  AND wiki ='enwiki'  
  AND event.popupEnabled = 1
  AND event.isAnon = 1
  AND event.linkInteractionToken IS NOT NULL) AS link_interactions_with_bucket
GROUP BY bucket, action
ORDER BY bucket, action LIMIT 10000;

SELECT bucket, action, COUNT(*) AS frequency FROM (
  SELECT CAST(event.totalInteractionTime AS BIGINT) AS bucket, 
  event.action AS action
  FROM event.popups  
  WHERE wiki ='enwiki'
  AND event.isAnon = 1
  AND ((month = 12 AND day >= 21) OR (month = 1) OR (month = 2 AND day <=7) )
  AND event.popupEnabled = 1
  AND useragent.os_family LIKE 'Windows%'
  AND event.linkInteractionToken IS NOT NULL) AS link_interactions_with_bucket
GROUP BY bucket, action
ORDER BY bucket, action LIMIT 10000;


SELECT bucket, action, COUNT(*) AS frequency FROM (
  SELECT CAST(event.totalInteractionTime AS BIGINT) AS bucket, 
  event.action AS action
  FROM event.popups  
  WHERE wiki ='enwiki'
  AND event.isAnon = 1
  AND ((month = 12 AND day >= 21) OR (month = 1) OR (month = 2 AND day <=7) )
  AND event.popupEnabled = 1
  AND useragent.os_family = 'Mac OS X'
  AND useragent.browser_family = 'Chrome'
  AND event.linkInteractionToken IS NOT NULL) AS link_interactions_with_bucket
GROUP BY bucket, action
ORDER BY bucket, action LIMIT 10000;

I'm inclined to believe that this is not a bug but an artefact of the resolution of JS timers, which apparently varies by browser and OS.

I agree with @phuedx's hypothesis, it really does look like some artifact of the timers resolution.

While they should no more be a blocker at this point, the explanation still raises a few further open questions on how the timer resolution causes the comb patterns in detail, see below (in particular, I couldn't verify the hypothesis about modern browsers/OSes in case of Chrome 64 on Mac OS X 10.13).

...

the scheduling of JS timer callbacks depends on the minimum timer resolution of the OS.

So this yields a beautiful explanation why the end time of a dismissed or dwelledButAbandoned link interaction would be distributed in this fashion (quantized by the OS timer resolution, then smeared out by a normal distribution). But what about the start of the link interaction - is that time measurement limited by the IS timer resolution too?

started should only be affected by performance.now behavior (very small variations) since there is no setTimeout in between the action starting and the time being tracked (sources: reducers/eventLogging.js#L239 & actions.js#L192)

If not, why would their difference (interaction.finished - interaction.started) be quantized that way?

If finished follows that special distribution, and started is evenly distributed, doesn't it make sense that the difference also shows the special distribution?

Was this meant to read "OS timer resolution" instead of "browser timer resolution"?

I believe "browser timer resolution" is correct. The OS may or not have some limitations on their timers, but the browsers add additional limits themselves.

In any case, I checked this hypothesis in the case of Chrome 64 on Max OS 10.13 (the most fine-grained versioning available in the parsed user agent field of EventLogging), and it doesn't quite hold - the histogram shows the ~16ms comb pattern at least as much as the general one, or the one for Windows

I'm not sure that modern windows timers are clamped to 15 or 16ms, that seems like something pre windows 8.1 according to docs. From windows 8.1 onwards the windows apis expose high accuracy timers (see that same link that @phuedx posted). As such, I would assume modern browsers are leveraging these where possible and applying the same clamping to the timers at the browser level.

I would expect the browser to behave very similarly across OS most of the time. That the distributions are similar across OS kind of makes sense.


This is a very interesting phenomenon, I think we are confident enough in our implementations and the fact that this only happens with the events that have setTimeout's involved in the code path that the theory makes sense, but it is still hard for me to grasp exactly why this is happening.

I graphed ~500 invocations to setTimeout(200) keeping track of the delta from Chrome 63 in OSX 10.13.3, and I don't see anything too obvious in the distribution (rounding the delta to 1 decimal place):

Screen Shot 2018-02-13 at 1.53.03 PM.png (444×747 px, 25 KB)

I was sort of expecting to see something. Rounding to the millisecond, it kind of shows a normal distribution, but not really:

Screen Shot 2018-02-13 at 2.04.53 PM.png (489×768 px, 21 KB)

I'm not sure if this can help explain the phenomenon, or maybe I would need more data for it to show, or what.

The variation is also a few ms, and not ~16, so there may be many other factors at play that we're not thinking about. Overhead of the code paths that are running, how the event loop is scheduling the DOM events and if there is some sort of timer or clamping in there, and god knows what else.

It is very hard to reason about this, it could be worthy to reach out with the data to the Mozilla or Chrome folks and just ask to see if we can get proper reasons for it.

Also, if we do reach out and get more info, it would be super interesting to do a blog post about this stuff, it isn't something that you can find online easily.

Thanks @Jhernandez, very interesting! I'll need some time to fully read and digest your findings; but one response inline for now:

I'm inclined to believe that this is not a bug but an artefact of the resolution of JS timers, which apparently varies by browser and OS.

I agree with @phuedx's hypothesis, it really does look like some artifact of the timers resolution.

(For the record, @phuedx and I discussed this a bit yesterday and we both think that some things still don't quite work out with this explanation. That said, it's the most likely possibility at this point.)

While they should no more be a blocker at this point, the explanation still raises a few further open questions on how the timer resolution causes the comb patterns in detail, see below (in particular, I couldn't verify the hypothesis about modern browsers/OSes in case of Chrome 64 on Mac OS X 10.13).

...

the scheduling of JS timer callbacks depends on the minimum timer resolution of the OS.

So this yields a beautiful explanation why the end time of a dismissed or dwelledButAbandoned link interaction would be distributed in this fashion (quantized by the OS timer resolution, then smeared out by a normal distribution). But what about the start of the link interaction - is that time measurement limited by the IS timer resolution too?

started should only be affected by performance.now behavior (very small variations) since there is no setTimeout in between the action starting and the time being tracked (sources: reducers/eventLogging.js#L239 & actions.js#L192)

If not, why would their difference (interaction.finished - interaction.started) be quantized that way?

If finished follows that special distribution, and started is evenly distributed, doesn't it make sense that the difference also shows the special distribution?

No, it wouldn't. Think about it this way: If you take a random number (i.e. a random variable) that is, say, evenly distributed modulo 16 and subtract a fixed number from it (as an extreme example of a quantized random variable), the result would still be evenly distributed modulo 16 ;)

It is very hard to reason about this, it could be worthy to reach out with the data to the Mozilla or Chrome folks and just ask to see if we can get proper reasons for it.

This. @Jhernandez and I chatted about this briefly this morning (GMT) and we were wondering if there's any other instrumentation that has timestamps and/or calculated durations that you know of. If there are, then, depending on their implementation, we may be able to prove that this isn't isolated to Page Previews.

(work log, switching back to the main task for now, the part that's independent of the timing questions ;)
We now also have a sizable amount of data from the new test launched in December (which ran a bit longer than anticipated, but is soon to be deactivated in T185973). I repeated the calculation of the difference in pageviews (per session) from T182314#3896974 for this new data.

In this timespan (specifically, the eight weeks from Dec 21-Feb 14), the pageviews drop was slightly larger on both wikis, with the average number of pageviews per session as follows:

  • dewiki: 2.11 for previews OFF vs. 2.05 for previews ON (-3.0% drop)
  • enwiki: 2.13 for previews OFF vs. 2.03 for previews ON (-4.7% drop)

The October/November result from T182314#3896974 for direct comparison:

  • dewiki: 2.05 for previews OFF vs. 2.00 for previews ON (-2.6% drop)
  • enwiki: 2.03 for previews OFF vs. 1.95 for previews ON (-3.7% drop)

Session lengths (i.e. the number of pageviews per session) were somewhat higher for the December-February sample. Besides the general seasonal changes in reader behavior - in particular around the holidays - this is likely at least partly due to the always present truncation effect (caused by browser sessions that started before the begin of the experiment/ended after it) combined with the longer timespan (eight weeks now vs. three weeks in October/November).

Data via

SELECT wiki,
 event.popupEnabled,
 COUNT(*) AS pageloaded_events,
 COUNT(DISTINCT event.pageToken) AS pageTokens,
 COUNT(DISTINCT event.sessionToken) AS sessions,
 COUNT(DISTINCT event.pageToken)/COUNT(DISTINCT event.sessionToken) AS page_tokens_per_session
 FROM event.popups
 WHERE 
  wiki IN ('enwiki', 'dewiki')
  AND event.isAnon = 1
  AND ((month = 12 AND day >= 21) OR (month = 1) OR (month = 2 AND day <=14) )
  AND event.action = 'pageLoaded'
 GROUP BY wiki, event.popupEnabled
 ORDER BY wiki, event.popupEnabled  LIMIT 10000
wikievent.popupenabledpageloaded_eventspagetokenssessionspage_tokens_per_session
dewikiFalse173974971736648182182222.113168
dewikiTrue168663821683558882128632.049905
enwikiFalse5686235656755806266218602.131925
enwikiTrue5416730254064464266055112.032078

[edited to add:
By coincidence, both wikis' OFF numbers from the first test match the ON numbers from the second test, as given above: 2.05 for dewiki and 2.03 for enwiki. This is not a typo; in fact only the rounded numbers are identical: As can be seen from the detailed result tables posted here, dewiki OFF had 2.048519 in the first test when calculated with 7-digit precision, compared to 2.049905 for dewiki ON in the second test.]

The disable rates for the second test were about equally low as in the first one (T182314#3909766 - the precise numbers were in fact somewhat lower in the second test, but I didn't check if the change is significant ):

The feature was disabled in around 0.01% of sessions (on both wikis).

Data via

SELECT wiki,
event.popupEnabled,
SUM(IF(event.action = 'disabled',1,0)) AS disables,
COUNT(DISTINCT event.sessionToken) AS sessions,
SUM(IF(event.action = 'disabled',1,0))/COUNT(DISTINCT event.sessionToken) AS disables_per_session
FROM event.popups
WHERE 
  wiki IN ('enwiki', 'dewiki')
  AND event.isAnon = 1
  AND ((month = 12 AND day >= 21) OR (month = 1) OR (month = 2 AND day <=14) )
 GROUP BY wiki, event.popupEnabled
 ORDER BY wiki, event.popupEnabled  LIMIT 10000
wikievent.popupenableddisablessessionsdisables_per_session
dewikiFalse70382206600.000086
dewikiTrue082151620.000000
enwikiFalse2047266301490.000077
enwikiTrue0266134000.000000

Page interactions (pageviews + seen previews) per session:

First test (Oct 23-Nov 12):

  • dewiki: 2.05 for previews OFF vs. 2.52 for previews ON (+23% increase)
  • enwiki: 2.03 for previews OFF vs. 2.45 for previews ON (+21% increase)

Second test Dec 21-Feb 14):

  • dewiki: 2.11 for previews OFF vs. 2.59 for previews ON (+23% increase)
  • enwiki: 2.13 for previews OFF vs. 2.57 for previews ON (+21% increase)

Data via:

First test (Oct 23-Nov 12):

SELECT wiki,
event.popupEnabled AS popupEnabled,
COUNT(DISTINCT event.pageToken) AS pageTokens,
COUNT(DISTINCT event.sessionToken) AS sessions,
COUNT(DISTINCT event.pageToken)/COUNT(DISTINCT event.sessionToken) AS page_tokens_per_session,
COUNT(DISTINCT event.linkInteractionToken) AS cards_1000viewed,
COUNT(DISTINCT event.linkInteractionToken)/COUNT(DISTINCT event.pageToken) AS cards_1000viewed_per_pageview,
COUNT(DISTINCT event.linkInteractionToken)/COUNT(DISTINCT event.sessionToken) AS cards_1000viewed_per_session,
(COUNT(DISTINCT event.linkInteractionToken)+COUNT(DISTINCT event.pageToken))/COUNT(DISTINCT event.sessionToken) AS page_interactions_per_session
FROM tbayer.Popups
WHERE 
  wiki IN ('enwiki', 'dewiki')
  AND event.isAnon = 1
  AND year = 2017
  AND ( (month = 10 AND day >= 23) OR (month = 11 AND day <= 12) )  
  AND (event.action = 'pageLoaded'
    OR (CAST(event.totalInteractionTime AS BIGINT) >= CAST(event.perceivedWait AS BIGINT) + 1000 ))
 GROUP BY wiki, event.popupEnabled
 ORDER BY wiki, popupEnabled LIMIT 10000
wikipopupenabledpagetokenssessionspage_tokens_per_sessioncards_1000viewedcards_1000viewed_per_pageviewcards_1000viewed_per_sessionpage_interactions_per_session
dewikiFalse644101631442312.048519200.0000030.0000062.048525
dewikiTrue627986031472961.99531916395370.2610790.5209352.516254
enwikiFalse21469097105916972.026974680.0000030.0000062.026981
enwikiTrue20654102105832711.95158053270750.2579180.5033492.454929

Second test (Dec 21-Feb 14):

SELECT wiki,
event.popupEnabled AS popupEnabled,
COUNT(DISTINCT event.pageToken) AS pageTokens,
COUNT(DISTINCT event.sessionToken) AS sessions,
COUNT(DISTINCT event.pageToken)/COUNT(DISTINCT event.sessionToken) AS page_tokens_per_session,
COUNT(DISTINCT event.linkInteractionToken) AS cards_1000viewed,
COUNT(DISTINCT event.linkInteractionToken)/COUNT(DISTINCT event.pageToken) AS cards_1000viewed_per_pageview,
COUNT(DISTINCT event.linkInteractionToken)/COUNT(DISTINCT event.sessionToken) AS cards_1000viewed_per_session,
(COUNT(DISTINCT event.linkInteractionToken)+COUNT(DISTINCT event.pageToken))/COUNT(DISTINCT event.sessionToken) AS page_interactions_per_session
FROM event.popups
WHERE 
  wiki IN ('enwiki', 'dewiki')
  AND event.isAnon = 1  
  AND ((month = 12 AND day >= 21) OR (month = 1) OR (month = 2 AND day <=14) )
  AND (event.action = 'pageLoaded'
    OR (CAST(event.totalInteractionTime AS BIGINT) >= CAST(event.perceivedWait AS BIGINT) + 1000 ))
 GROUP BY wiki, event.popupEnabled
 ORDER BY wiki, popupEnabled LIMIT 10000
wikipopupenabledpagetokenssessionspage_tokens_per_sessioncards_1000viewedcards_1000viewed_per_pageviewcards_1000viewed_per_sessionpage_interactions_per_session
dewikiFalse1736648182182222.113168750.0000040.0000092.113177
dewikiTrue1683620082131322.04991244684000.2654040.5440562.593968
enwikiFalse56755806266218602.1319251310.0000020.0000052.131930
enwikiTrue54067021266066002.032091144003230.2663420.5412312.573322

NB: As visible from the tables, apparently a very small percentage of clients (less than 0.001% of sessions) sent the wrong value for popupEnabled. This does not affect the accuracy of the result as stated above.

So @ovasileva and I continued to discuss the above page interactions metric (pageviews + seen previews) and eventually settled on a different metric which we think is more pertinent: The number of distinct (unique) pages the reader interacts with via either pageviews or seen previews, per browser session.
(E.g. a page that is first previewed and then opened in a separate pageview would be counted only once. In the control group - previews OFF - this metric basically coincides with the unique pageviews already used in context of the back button analysis at T144603#4010992 .)

The A/B test results for this number of "unique page interactions", or more descriptively, the number of distinct pages interacted with, as averages per session:

First test (Oct 23-Nov 12):

  • dewiki: 1.71 for previews OFF vs. 2.05 for previews ON (+20% increase)
  • enwiki: 1.64 for previews OFF vs. 2.0 for previews ON (+22% increase)

Second test (Dec 21-Feb 14):

  • dewiki: 1.74 for previews OFF vs. 2.10 for previews ON (+21% increase)
  • enwiki: 1.71 for previews OFF vs. 2.07 for previews ON (+21% increase)

Data via:

First test (Oct 23-Nov 12):

SELECT wiki, popupEnabled, 
AVG(upageinteractions) AS avg_unique_interactions
FROM (
    SELECT wiki, popupEnabled, 
    sessionToken, COUNT(DISTINCT pagetitle) AS upageinteractions
	FROM (
	    SELECT wiki, event.popupEnabled AS popupEnabled, 
	    event.sessionToken AS sessionToken, 
	    IF(event.action = 'pageLoaded',event.pageTitleSource, REGEXP_REPLACE(event.pageTitleHover, '_', ' ')) AS pagetitle
	    FROM tbayer.Popups
	    WHERE (wiki ='enwiki' OR wiki ='dewiki')
	    AND event.isAnon = 1
	    AND year = 2017
            AND ( (month = 10 AND day >= 23) OR (month = 11 AND day <= 12) )  
	    AND ( event.action = 'pageLoaded'  
	      OR (CAST(event.totalInteractionTime AS BIGINT) >= CAST(event.perceivedWait AS BIGINT) + 1000 ) )
	    ) AS interactionslist
	GROUP BY wiki, popupEnabled,
	sessionToken ) AS sessionslist
GROUP BY wiki, popupEnabled
ORDER BY wiki, popupEnabled LIMIT 1000
wikipopupenabledavg_unique_interactions
dewikiFalse1.706274
dewikiTrue2.054153
enwikiFalse1.642089
enwikiTrue1.995986

Second test (Dec 21-Feb 14):

SELECT wiki, popupEnabled, 
AVG(upageinteractions) AS avg_unique_interactions
FROM (
    SELECT wiki, popupEnabled, 
    sessionToken, COUNT(DISTINCT pagetitle) AS upageinteractions
	FROM (
	    SELECT wiki, event.popupEnabled AS popupEnabled, 
	    event.sessionToken AS sessionToken, 
	    IF(event.action = 'pageLoaded',event.pageTitleSource, REGEXP_REPLACE(event.pageTitleHover, '_', ' ')) AS pagetitle
	    FROM event.Popups
	    WHERE (wiki ='enwiki' OR wiki ='dewiki')
	    AND event.isAnon = 1
	    AND ((month = 12 AND day >= 21) OR (month = 1) OR (month = 2 AND day <=14) )
	    AND ( event.action = 'pageLoaded'  
	      OR (CAST(event.totalInteractionTime AS BIGINT) >= CAST(event.perceivedWait AS BIGINT) + 1000 ) )
	    ) AS interactionslist
	GROUP BY wiki, popupEnabled,
	sessionToken ) AS sessionslist
GROUP BY wiki, popupEnabled
ORDER BY wiki, popupEnabled LIMIT 1000
wikipopupenabledavg_unique_interactions
dewikiFalse1.743824
dewikiTrue2.102764
enwikiFalse1.707443
enwikiTrue2.072581

A note about the details of the calculation method:
For simplicity, these queries ignore namespaces (i.e. possibly undercount slightly if there are pages with the same title in different namespaces occurring in the same session, e.g. https://de.wikipedia.org/wiki/Portal:Gesellschaft and https://de.wikipedia.org/wiki/Gesellschaft ). This only affects the pageviews part, as preview are limited to links to mainspace pages anyway. The unique pageviews calculation at T144603#4010992 avoided this issue by using page IDs instead of page names, but these are not available for previews in the intrumentation data. A comparison shows that the difference is negligible for present purposes (the results are the same to two decimals as stated above).

Removed "Rate of accidental card views" from the task because we never got around to define a solid calculation method for this (considering the earlier observations at e.g. T139319#2475811 ).
Also a note that results so far have been given on a per-wiki basis only. Weighing the two wikis properly against each other to arrive at a solid "cumulative" result would be a bit tricky.

So for the report we used the pageviews per session averaged over the entire experiment timespan, but I had also looked at the time series of pageviews per daily session (which obviously is subject to much larger truncation effects, but offers the additional perspective of a repeat experiment, so to speak). This shows that the slight decrease in pageviews is consistent and passes a sign test:

Pageviews per daily sessions, enwiki, previews ON vs. OFF, Oct-Nov 2017.png (442×545 px, 33 KB)

(Haven't yet gotten around to prepare the analogous graphs for dewiki and the second experiment, but wanted to post this one here now before I forget.)

Data via

SELECT wiki,
year, month, day, 
CONCAT(year,'-',LPAD(month,2,'0'),'-',LPAD(day,2,'0')) AS ymd, 
event.popupEnabled AS popupEnabled,
COUNT(DISTINCT event.pageToken)/COUNT(DISTINCT event.sessionToken) AS page_tokens_per_session,
COUNT(DISTINCT event.sessionToken) AS sessions
FROM tbayer.Popups
WHERE wiki IN ('enwiki', 'dewiki')
AND event.isAnon = 1
AND year = 2017
AND event.action = 'pageLoaded'
GROUP BY wiki, year, month, day, event.popupEnabled
ORDER BY wiki, year, month, day, popupEnabled LIMIT 10000

Leaving this here for now for later, as a followup regarding the timing resolution mystery:
A histogram for the performance.now() based timestamp field added most recently to the Popups schema, rounded to 1ms. Seems there is a slight 8ms periodicity, but nothing like T182314#3954765 above.

Link interaction end time, measured by performance.now() (previews enabled, enwiki, Dec 21, 2017-Feb 14, 2018) 3000-3200ms.png (438×604 px, 33 KB)

Data via

SELECT 
bucket,
COUNT(*) AS linkinteractions
FROM (
  SELECT ROUND(event.timestamp) AS bucket
  FROM event.popups
  WHERE ((month = 12 AND day >= 21) OR (month = 1) OR (month = 2 AND day <=14) )
  AND wiki ='enwiki' -- TODO: redo this for dewiki
  AND event.popupEnabled = 1
  AND event.linkInteractionToken IS NOT NULL) AS linkinteractionslist
GROUP BY bucket
ORDER BY bucket LIMIT 10000

FYI although the main report is done, we're keeping this open so that Tilman can do some follow up analysis!

(leaving this here for reference:)
The average number of link hovers per pageview (with any outcome, including no card shown; in test or control; for anonymous desktop pageviews) was around 4.4 in the second experiment (December-February).

As detailed above (T182314#4016412 ), the average number of seen previews per pageviews (in the test group during the second experiment) was about 0.27, i.e. a preview was deliberately viewed in about every 17th hover. That underlines the importance of the delay before the card starts to be shown, see also the graphs (still need to post a more polished version of these now that we have a better grip on the comb artefacts discussed above).

SELECT wiki,
COUNT(DISTINCT event.pageToken) AS pageTokens,
COUNT(DISTINCT event.linkInteractionToken) AS links_hovered,
COUNT(DISTINCT event.linkInteractionToken)/COUNT(DISTINCT event.pageToken) AS links_hovered_per_pageview
FROM event.popups
WHERE
wiki IN ('enwiki', 'dewiki')
AND ((month = 12 AND day >= 21) OR (month = 1) OR (month = 2 AND day <=14) )
AND event.isAnon = 1
GROUP BY wiki
ORDER BY wiki LIMIT 10000;

wiki	pagetokens	links_hovered	links_hovered_per_pageview
dewiki	34210998	150866348	4.409878601027658
enwiki	110852833	490263663	4.422653438185021
2 rows selected (2618.878 seconds)

[edited to add: The average number link hovers (and seen previews) per pageview should normally be calculated based on data that is sampled per pageview; but in order to answer the other research questions, this instrumentation used sampling by browser session instead. This can introduce inaccuracies because these numbers for each pageview may not be statistically independent within one session; however we assume that this error is very small, considering the small average session lengths.]

@phuedx From https://developers.google.com/web/updates/2018/02/meltdown-spectre

To exploit Meltdown or Spectre, an attacker needs to measure how long it takes to read a certain value from memory. For this, a reliable and accurate timer is needed.

One API the web platform offers is performance.now() which is accurate to 5 microseconds. As a mitigation, all major browsers have decreased the resolution of performance.now() to make it harder to mount the attacks.

See also https://developer.mozilla.org/en-US/docs/Web/API/Performance/now which has a bit more detail about specific browsers. Appears that 1ms is Firefox's default resolution, but that certain settings in different browsers (including privacy-related) may alter that.

Thanks @Imarlier! When I looked at this earlier (T182314#3947366 ), the anomalies didn't seem to be related to those Meltdown/Spectre-related changes, but if you have more detailed observations, they would be welcome.

Some browsers are targeting 60fps, which means around 16ms between paint frames to execute stuff. I don't know if somehow those 60fps targets could be somehow synced to round times, which would explain a 8/16ms cycle, and the spikes would be clients that are up to date with NTP and experiencing smooth execution, others that have a clock skew, that are currently experiencing jank or that are coming from browsers that would have a different behaviour.

It's all a quite a stretch, but that's what came to mind when I saw that a value related to 16 showed up.

It would be interesting to see if this pattern is more prominent on some user agents or some hardware platforms than others.

Per discussion with @ovasileva , I'm closing this task now, and splitting off the investigation of the timing anomalies into T216852. (To recap, while we found out a lot of things about how the underlying timers work and reached a level of confidence that these anomalies should not materially affect the main takeaways for this A/B test, we have not yet found a satisfying explanation for them.)

For reference, the main outcomes from this analysis have been summarized at https://www.mediawiki.org/wiki/Page_Previews/2017-18_A/B_Tests .

The "Nice to have" item in the task description, about the effects on "reading depth"/time spent per page, was not done. We have since conducted a fuller investigation and vetting of this new metric in a separate research project, and applied it to a more recent A/B test, for the page issues feature (for which we also included better cross-wiki sampling support in the instrumentation.

There are a couple of interesting followup questions regarding the Popups data, which might be addressed elsewhere. It is also of relevance to WMDE's current work on the related reference previews project.