The long-term goal of both Product-Analytics and Analytics is to have all the data necessary for Product Analytics' work available in the Data Lake.
Analytics has already made significant progress towards this goal (e.g. T161147, T186559), but naturally much remains to be done. This task tracks that work. Specific issues are tracked by subtasks (more to come—so far I've just organized the existing ones), but there are also some more general considerations.
As of December 2018, I'd estimate that Product-Analytics uses the Data Lake for about 60% of our analyses. Once revision tags are available (T161149), I expect that will rise to about 80%.
However, getting all the way to 100% will be harder. Here are some important reasons for this:
Every table, some day
There a long tail of analysis which gets done rarely and unpredictable and can use any of the Mediawiki tables (except for the ones that are unused and ready to be dropped: T54921). For example, this could be looking at:
- notification patterns (T113664, using the echo_event and echo_notification tables)
- skin preferences (T147696, using the user_properties tables)
- Wikipedia-to-Wikidata linkage patterns (T209891#4798717, using the page_props table)
- when users add email addresses to their account (T212172#4850805, using the user_email_token_expires field of the user table)
- new user profile information (T212172#4839511, using the user_properties table)
- usage of the Wikimedia Maps service (T212172#4866167, using the page_props table)
- which templates are most frequently used (T96323, using the templatelinks and page_props table)
- what proportion of pages are tagged with issues (T201123#4494446, using the templatelinks table)
- the number of images with specific Creative Commons licenses on Commons (T150076, using the category and categorylinks tables)
- the number of various types of files present on commons (T150076, using the image table)
- how often different abuse filters are triggered (T212172#4871548, using the abuse_filter and abuse_filter_log tables)
- when a hidden abuse filter was in effect (T212172#4871548, using the abuse_filter and abuse_filter_history tables)
Currently we use the MediaWiki replicas for this information. To feel confident that we'd never them, we'd need all the MediaWiki available in the Data Lake.
Real-time data is real important
Currently, the mediawiki_history data and related datasets in the Data Lake are loaded in monolithic monthly snapshots, which do not arrive until about 10 days after the end of the month. This means that there can be up to 40 day lag between when data is generated and when it's available in the Data Lake.
For many analyses, this is fine, but for others, like an A/B test or a crisis like a spam attack or community protest during the first weeks of a month, it's a problem.
In many cases, it is possible to use EventBus data, which does arrive in real time. However, even when the data is passing through EventBus, this approach has its own limitations. For example, EventBus data misses out of much of the valuable extra data that the monthly datasets provide (e.g. event_user_seconds_since_previous_revision , event_user_is_created_by_system, geolocation from geoeditors_daily), and the EventBus data may not extend far back in time (e.g. the revision_tag_create table only goes back to September 2018: T201062).
It's theoretically possible to combine both EventBus and mediawiki_history data in a single analysis, but having to draw on two different sources with quite disparate schemas for the same data adds a lot of complexity.
In addition, in many cases analysis has to be done on tables which only provide information about present state. For example, to provide daily information on how many pages contain maps, the raw page_props must be queried every day and the results stored. Even if the page_props table was loaded into the Data Lake once a month, the Data Lake still wouldn't be sufficient for this analysis.
Private data
We have various uses for private data, which generally isn't included in Edit data in the Data Lake, because it's pulled from the public wiki replicas in the Cloud Services cluster:
- Pulling user email addresses to contact them for a survey or other research project
- Getting data about revision deleted revisions
Quick lookups
It's often important to look up individual rows of user data for exploration or to diagnose data inconsistencies (e.g. T221338, which required dozens of these lookups to diagnose). With the MediaWiki replicas, these generally take less than a second; with Hive, a single lookup can take several minutes. Presto might help with this, but failing that, we will need the MediaWiki replicas to facilitate fast lookups.