For the full results of the investigation, see wikimedia-research/2019-02-active-editors-discrepancy on GitHub (backup nbviewer link if that fails to display).
Issues identified
Major
- T221338: Many revision events in mediawiki_history have missing page and namespace information
- T218463: Some registered users have null values for event_user_text and event_user_text_historical in mediawiki_history
Minor
- T218824: A few alterblocks events have event_timestamps from before 2001
- Importing revisions from other wikis can add arbitrary amounts of history at arbitrary points in the past (identifying imported revisions requested in T221482).
- Revision deletion of user names results in null user information in mediawiki_history (documented on Wikitech and in T212172)
- T220456: Many small wikis missing from mediawiki_history dataset, which led to the inconsistent exclusion of 119 small wikis.
- The MediaWiki replica pipeline counted only wikis in a specific set of sitegroups, whereas the Data Lake pipeline counted all the wiki present in the mediawiki_history data, which led to the inconsistent inclusion of 34 small wikis.
- The MediaWiki replica pipeline considered a user a bot if they were ever in the bot group (since that information is easily accessible from the user table), whereas the Data Lake pipeline considered them a bot in a given month if they were in the group during the given month (event_user_groups_historical) or at the time the snapshot was generated (event_user_groups). The Data Lake approach appears to give superior results.