Page MenuHomePhabricator

Investigate discrepancies in editor metrics between Data Lake and MediaWiki replica pipelines
Closed, ResolvedPublic


For the full results of the investigation, see wikimedia-research/2019-02-active-editors-discrepancy on GitHub (backup nbviewer link if that fails to display).

Issues identified



  • T218824: A few alterblocks events have event_timestamps from before 2001
  • Importing revisions from other wikis can add arbitrary amounts of history at arbitrary points in the past (identifying imported revisions requested in T221482).
  • Revision deletion of user names results in null user information in mediawiki_history (documented on Wikitech and in T212172)
  • T220456: Many small wikis missing from mediawiki_history dataset, which led to the inconsistent exclusion of 119 small wikis.
  • The MediaWiki replica pipeline counted only wikis in a specific set of sitegroups, whereas the Data Lake pipeline counted all the wiki present in the mediawiki_history data, which led to the inconsistent inclusion of 34 small wikis.
  • The MediaWiki replica pipeline considered a user a bot if they were ever in the bot group (since that information is easily accessible from the user table), whereas the Data Lake pipeline considered them a bot in a given month if they were in the group during the given month (event_user_groups_historical) or at the time the snapshot was generated (event_user_groups). The Data Lake approach appears to give superior results.

Event Timeline

nshahquinn-wmf created this task.
nshahquinn-wmf moved this task from Triage to Doing on the Product-Analytics board.

I have now identified pretty much all the discrepancies; they're listed in the description, along with the link to the full notebook of my results.

@JAllemandou, I've filed all the issues that could use your attention as separate tasks. They're all linked in the description, so please take a look at any you haven't seen already :)