This is a task to coordinate the release a subset of real user performance data that was collected while conducting the first round of research from this project: https://meta.wikimedia.org/wiki/Research:Study_of_performance_perception_on_Wikimedia_projects Which led to a short paper entitled "A Large-scale Study of Wikipedia Users' Quality of Experience", due to be presented and published at The Web Conference 2019.
I can share the camera-ready version of the paper privately with anyone at Wikimedia who might be interested before its publication, as it might help understand why this specific chunk of data is being requested for publication.
I expect that Analytics, Legal and Security will want to review this dataset. Feel free to create dedicated subtasks for each team.
= Timespan =
2018-05-24 12:55:12 -> 2018-10-15 11:59:52
= Wikis =
Data was collected on cawiki, frwiki, enwikivoyage and ruwiki. We need at the very least data for ruwiki.
= Data fields =
The following have all been collected client-side, via the [[ https://www.mediawiki.org/wiki/Extension:NavigationTiming | NavigationTiming extension ]]:
# **wiki** Which wiki the request was on (ruwiki, cawiki, eswiki, frwiki or enwikivoyage)
# **time** Timestamp, can be rounded to the minute or the hour if needed. We don't need second accuracy at all. But it's useful in the study to demonstrate like of temporal correlation (time of day, day of week, day of month). Since we don't need the timestamp to be the real one to prove lack of temporal correlation, the timestamp values should be shifted by an arbitrary value for the entire dataset.
# **unload** [1] The time spent on unload (unloadEventEnd - unloadEventStart).
# **redirecting** [1] Time spent following redirects.
# **fetchStart** [1] The time immediately before the user agent starts checking any relevant application caches.
# **dnsLookup** [1] Time it took to resolve names (domainLookupEnd - domainLookupStart).
# **secureConnectionStart** [1] The time immediately before the user agent starts the handshake process to secure the current connection.
# **connectStart** [1] The time immediately before the user agent start establishing the connection to the server to retrieve the document.
# **connectEnd** [1] The time immediately after the user agent finishes establishing the connection to the server to retrieve the current document.
# **requestStart** [1] The time immediately before the user agent starts requesting the current document from the server, or from relevant application caches or from local resources.
# **responseStart** [1] The time immediately after the user agent receives the first byte of the response from the server, or from relevant application caches or from local resources.
# **responseEnd** [1] The time immediately after the user agent receives the last byte of the current document or immediately before the transport connection is closed, whichever comes first.
# **loadEventStart** [1] The time immediately before the load event of the current document is fired.
# **loadEventEnd** [1] The time when the load event of the current document is completed.
# **mediawikiLoadEnd** Mediawiki-specific. The time at which all ResourceLoader modules for this page have completed loading and executing.
# **domComplete** [1] The time immediately before the user agent sets the current document readiness to "complete".
# **domInteractive** [1] The time immediately before the user agent sets the current document readiness to "interactive".
# **gaps** [1] The gaps in the Navigation Timing metrics. Calculated by taking the sum of: domainLookupStart - fetchStart, connectStart - domainLookupEnd, requestStart - connectEnd and loadEventStart - domComplete.
# **firstPaint** [2] The time when something is first displayed on the screen.
# **rsi** [3] RUMSpeedIndex. Estimate of the SpeedIndex value based on ResourceTiming data. //Now moved to the [[ https://meta.wikimedia.org/wiki/Schema:RUMSpeedIndex | RUMSpeedIndex EventLogging schema ]], but was collected as part of the NavigationTiming schema at the time of the study.//
And the following metrics, that are derivatives of metrics coming from NavigationTiming, designed to preserve privacy:
20. **speed_quantized** The page download speed evaluated as (**transferSize** *8)/(**loadEventStart** - **fetchStart**) quantized in these bins = [0,100,200,300,400,500,600, 700, 800,900,1000,20000]// (the sensitive metric is transferSize [1], the size of the gzipped html of the article measured)//
21. **speed_over_median_per_country** The page download speed (evaluated as above) normalized over the median per-country speed observed in the dataset.
Finally, the response users gave to the perception survey:
22. **surveyResponseValue** Can be "yes", "no", or "not sure". The question asked being "Did this page load fast enough?".
[1] metrics coming from the browsers' implementation of the NavigationTiming API ([[ https://www.w3.org/TR/navigation-timing/ | level 1 ]] and [[ https://w3c.github.io/navigation-timing/ | level 2 ]]).
[2] firstPaint comes from the [[ https://www.w3.org/TR/paint-timing/ | Paint Timing API ]] or vendor-specific implementations predating the standards.
[3] RUMSpeedIndex is a compound metric combining several NavigationTiming and ResourceTiming ([[ https://www.w3.org/TR/resource-timing-1/ | level 1 ]] and [[ https://www.w3.org/TR/resource-timing-2/ | level 2 ]]) metrics into a single score. It's a 3rd-party FLOSS library found here: https://github.com/WPO-Foundation/RUM-SpeedIndex
EventLogging schemas these fields are coming from:
- [[ https://meta.wikimedia.org/wiki/Schema:NavigationTiming | NavigationTiming ]] (fields 1-21 in the list above)
- [[ https://meta.wikimedia.org/wiki/Schema:QuickSurveysResponses | QuickSurveysResponses ]] (field 22)