Page MenuHomePhabricator

"Source of truth" dataset for pageviews
Open, LowPublic


Request Status: New Request
Request Type: project support request
Related OKRs: Metrics that matter, Knowledge Equity - Regional views

Request Title: "Source of truth" dataset for pageviews

  • Request Description: The pageview data we have available in pageviews_daily and pageviews_hourly has known data issues that make using those datasets difficult, and the risk of drawing faulty conclusions is high:
    • Highest priority: Between June 2021 and January 2022, there was a data loss that impacted pageviews, particularly in certain regions like the Americas (see T300164). When we explore the data using Wikistats or Superset, it's easy to forget about the data loss and draw conclusions about the trends that are incorrect and even directionally inaccurate. Some people in the Foundation and our communities don't know about this data loss in the first place. We also don't have a standardized calculations (yet) for correcting the data in that time period, and continuing to do calculations on an ad hoc basis introduces more room for error.
    • In late 2019 and early 2020, there was a large amount of (suspected) bot activity that skewed pageviews, particularly in the United States (see T239811). This happened before Data Engineering implemented a flag for automated traffic. We frequently use 2019 as a pre-pandemic baseline in analysis but need to correct for the bot activity when looking at US traffic and none referrers.
    • There are several other similar issues that we'll also want to consider, to be added later.
  • Indicate Priority Level: high
  • Main Requestors: @kzimmerman
  • Ideal Delivery Date: FY22-23
  • Stakeholders:

Request Documentation

Document TypeRequired?Document/Link
Related PHAB TicketsYesT300164 (documentation of 2021 data loss), T239811 (investigation revealing suspected bot traffic in 2019)
Product One PagerYes<add link here>
Product Requirements Document (PRD)Yes<add link here>
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Other LinksNo<add links here>

Event Timeline

mpopov subscribed.

Note to future self: not moving to Tracking because this will be a collaboration

mpopov triaged this task as High priority.Jun 28 2022, 5:12 PM

Discussed during our 1:1 in July. We have a couple of approaches in mind -

  • Providing a warning, on datahub or Wikitech.
  • Hive: when you run a query on pageview data tables (webrequest, pageview_hourly, pageview_daily etc.), a header will say “this data is incomplete..” and should be available on all tools using Hive.
  • Any local copy of the data will not have that flag
  • Remove data from that period
  • Provide two columns - Observed data and projected data
  • Remove incorrect pageview data from that period from all public datasets
    • API - will have a header, similar to hive that will say “this data is incomplete..”
    • Dumps: community datasets - a separate section which has data loss. Separate flat file with that dump, with the data loss with a warning acknowledging the loss.
    • Separate action you have to take to go access internal data for observed data lowered the priority of this task from High to Low.Feb 28 2023, 5:46 PM

Changing this to Medium, as we have workarounds in place and have been using the corrected pageviews in several reports. We have also provided recommendations and documented the issues on Wikitech.
Pageviews metric needs a fair amount of work overall, we will prioritize this task after work on Unique Devices and improving bot detection.

Removing this from the Pageview Data Loss epic, since this task also includes needs to account for other issues.