Page MenuHomePhabricator

A Large-scale Study of Wikipedia Users' Quality of Experience: data release
Closed, ResolvedPublic

Description

This is a task to coordinate the release a subset of real user performance data that was collected while conducting the first round of research from this project: https://meta.wikimedia.org/wiki/Research:Study_of_performance_perception_on_Wikimedia_projects Which led to a short paper entitled "A Large-scale Study of Wikipedia Users' Quality of Experience", due to be presented and published at The Web Conference 2019.

I can share the camera-ready version of the paper privately with anyone at Wikimedia who might be interested before its publication, as it might help understand why this specific chunk of data is being requested for publication.

I expect that Analytics, Legal and Security will want to review this dataset. Feel free to create dedicated subtasks for each team.

Timespan

2018-05-24 12:55:12 -> 2018-10-15 11:59:52

Wikis

Data was collected on cawiki, frwiki, enwikivoyage and ruwiki. We need at the very least data for ruwiki.

Data fields

The following have all been collected client-side, via the NavigationTiming extension:

  1. wiki Which wiki the request was on (ruwiki, cawiki, eswiki, frwiki or enwikivoyage)
  2. time Timestamp, can be rounded to the minute or the hour if needed. We don't need second accuracy at all. But it's useful in the study to demonstrate like of temporal correlation (time of day, day of week, day of month). Since we don't need the timestamp to be the real one to prove lack of temporal correlation, the timestamp values should be shifted by an arbitrary value for the entire dataset.
  3. unload [1] The time spent on unload (unloadEventEnd - unloadEventStart).
  4. redirecting [1] Time spent following redirects.
  5. fetchStart [1] The time immediately before the user agent starts checking any relevant application caches.
  6. dnsLookup [1] Time it took to resolve names (domainLookupEnd - domainLookupStart).
  7. secureConnectionStart [1] The time immediately before the user agent starts the handshake process to secure the current connection.
  8. connectStart [1] The time immediately before the user agent start establishing the connection to the server to retrieve the document.
  9. connectEnd [1] The time immediately after the user agent finishes establishing the connection to the server to retrieve the current document.
  10. requestStart [1] The time immediately before the user agent starts requesting the current document from the server, or from relevant application caches or from local resources.
  11. responseStart [1] The time immediately after the user agent receives the first byte of the response from the server, or from relevant application caches or from local resources.
  12. responseEnd [1] The time immediately after the user agent receives the last byte of the current document or immediately before the transport connection is closed, whichever comes first.
  13. loadEventStart [1] The time immediately before the load event of the current document is fired.
  14. loadEventEnd [1] The time when the load event of the current document is completed.
  15. mediawikiLoadEnd Mediawiki-specific. The time at which all ResourceLoader modules for this page have completed loading and executing.
  16. domComplete [1] The time immediately before the user agent sets the current document readiness to "complete".
  17. domInteractive [1] The time immediately before the user agent sets the current document readiness to "interactive".
  18. gaps [1] The gaps in the Navigation Timing metrics. Calculated by taking the sum of: domainLookupStart - fetchStart, connectStart - domainLookupEnd, requestStart - connectEnd and loadEventStart - domComplete.
  19. firstPaint [2] The time when something is first displayed on the screen.
  20. rsi [3] RUMSpeedIndex. Estimate of the SpeedIndex value based on ResourceTiming data. Now moved to the RUMSpeedIndex EventLogging schema, but was collected as part of the NavigationTiming schema at the time of the study.

And the following metrics, that are derivatives of metrics coming from NavigationTiming, designed to preserve privacy:

  1. speed_quantized The page download speed evaluated as (transferSize *8)/(loadEventStart  - fetchStart) quantized in these bins = [0,100,200,300,400,500,600, 700, 800,900,1000,20000] (the sensitive metric is transferSize [1], the size of the gzipped html of the article measured)
  2. speed_over_median_per_country The page download speed (evaluated as above) normalized over the median per-country speed observed in the dataset.

Finally, the response users gave to the perception survey:

  1. surveyResponseValue Can be "yes", "no", or "not sure". The question asked being "Did this page load fast enough?".

[1] metrics coming from the browsers' implementation of the NavigationTiming API (level 1 and level 2).
[2] firstPaint comes from the Paint Timing API or vendor-specific implementations predating the standards.
[3] RUMSpeedIndex is a compound metric combining several NavigationTiming and ResourceTiming (level 1 and level 2) metrics into a single score. It's a 3rd-party FLOSS library found here: https://github.com/WPO-Foundation/RUM-SpeedIndex

EventLogging schemas these fields are coming from:

Event Timeline

Gilles created this task.
Gilles triaged this task as Medium priority.Feb 28 2019, 3:24 PM
Milimetric raised the priority of this task from Medium to High.Feb 28 2019, 5:51 PM
Milimetric lowered the priority of this task from High to Medium.
Milimetric moved this task from Incoming to Data Quality on the Analytics board.
Milimetric subscribed.

@JBennett @JFishback_WMF could we get an update on when this might get looked at?

@JBennett and @JFishback_WMF can you please assign this task to someone on your end so we can make sure it has an owner and it will be processed? Also if you can provide a sense of timelines for getting back to us, that'd be great.

@leila and @Gilles I'll work on this. I'll get started on it as soon as I can, but is there a particular timeline we're tracking to?

@JFishback_WMF Gilles can speak to the timelines better. From my perspective, the sooner the better so it doesn't fall in the backlog of things to do. :)

Well... the researchers ended up redacting their journal submission to reflect the fact that they couldn't release the dataset at this time. But I think they're still very eager to do so. If would be nice to be able to do this before the end of the calendar year?

Let's see, this dataset has no page info, neither timestamps, is that correct?

My bad. We probably want timestamp as well, but it can be very coarse (rounded to the hour is fine), as well as which wiki we're dealing with (since the study ran on multiple wikis). I'll add that to the task description.

@Gilles it would be good to shift timestamps so this data cannot be linked (or rather, obviously linked) with any data existing (say, pageviews per wiki per hour which are released hourly).
This is a technique we have used in prior data releases. Also, worth thinking that if wiki does not add much to the dataset (that is, perception of performance is not dependent on the wiki per your study) it might be better to remove it to make the dataset more opaque.

Sure, we can shift the timestamps by an arbitrary amount. It would still prove lack of temporal correlation.

The satisfaction ratios per wiki are a bit different: https://grafana.wikimedia.org/d/000000551/performance-perception-survey?orgId=1 We had the same findings when looking at each wiki separately, but we can't really mix data between different wikis. Another possibility is to only keep the ruwiki data, which has by far the largest traffic during the study period.

Another possibility is to only keep the ruwiki data, which has by far the largest traffic during the study period.

Sounds good, if you modify ticket description of fields maybe we can take a look together with @JFishback_WMF laster on this month?

FYI that released one-off datasets get documented in meta like, for example: https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream

After discussing this, we've come to the conclusion that we can do this effectively with 2 separate datasets.

One with just the wiki, the time, rounded to the hour, and the survey response (1, 2 and 22).

Another with wiki, performance metrics and the survey response, completely shuffled to lose the original order (1, 3-22).

Finally, we would release data only for ruwiki and frwiki.

@Fsalutari has put together a sanitised version of the dataset(s) according to the instructions we agreed on. It's available as df_releasable_1.csv and df_releasable_2.csv under /home/fsalutari on stat1004. I probably won't get a chance to look at it until the week of November 10, but I figured I'd share it here in case you want to look at it before I get a chance to, @Nuria

I've reviewed the data, it's exactly what we had requested and looks completely safe to release. @Nuria can you review these 2 files (takes 5 minutes, really) by Nov 4?

Looked at this, +1 on my end, @JFishback_WMF needs (per our privacy framework) write up a risk assessment that goes on wikitech together with the pointers to the data release and I think we are ready.

Due to the low impact of harm, and low opportunity and probability of malicious use of this data, coupled with the mitigations of aggregation and minimization, the residual risk of releasing this data is considered LOW.

WMF-Legal can we please get someone to sign off on this?

Assigning back to @Gilles but let me know if there's anything else you need from me on this.

Folks, I removed Research. Happy to help if our help is needed at some point.

Fantastic, thank you very much!