Page MenuHomePhabricator

Provide feature parity between the wiki replicas and the Analytics Data Lake
Closed, DeclinedPublic

Description

The long-term goal of both Product-Analytics and Analytics is to have all the data necessary for Product Analytics' work available in the Data Lake.

Analytics has already made significant progress towards this goal (e.g. T161147, T186559), but naturally much remains to be done. This task tracks that work. Specific issues are tracked by subtasks (more to come—so far I've just organized the existing ones), but there are also some more general considerations.

As of December 2018, I'd estimate that Product-Analytics uses the Data Lake for about 60% of our analyses. Once revision tags are available (T161149), I expect that will rise to about 80%.

However, getting all the way to 100% will be harder. Here are some important reasons for this:

Every table, some day

There a long tail of analysis which gets done rarely and unpredictable and can use any of the Mediawiki tables (except for the ones that are unused and ready to be dropped: T54921). For example, this could be looking at:

  • notification patterns (T113664, using the echo_event and echo_notification tables)
  • skin preferences (T147696, using the user_properties tables)
  • Wikipedia-to-Wikidata linkage patterns (T209891#4798717, using the page_props table)
  • when users add email addresses to their account (T212172#4850805, using the user_email_token_expires field of the user table)
  • new user profile information (T212172#4839511, using the user_properties table)
  • usage of the Wikimedia Maps service (T212172#4866167, using the page_props table)
  • which templates are most frequently used (T96323, using the templatelinks and page_props table)
  • what proportion of pages are tagged with issues (T201123#4494446, using the templatelinks table)
  • the number of images with specific Creative Commons licenses on Commons (T150076, using the category and categorylinks tables)
  • the number of various types of files present on commons (T150076, using the image table)
  • how often different abuse filters are triggered (T212172#4871548, using the abuse_filter and abuse_filter_log tables)
  • when a hidden abuse filter was in effect (T212172#4871548, using the abuse_filter and abuse_filter_history tables)

Currently we use the MediaWiki replicas for this information. To feel confident that we'd never them, we'd need all the MediaWiki available in the Data Lake.

Real-time data is real important

Currently, the mediawiki_history data and related datasets in the Data Lake are loaded in monolithic monthly snapshots, which do not arrive until about 10 days after the end of the month. This means that there can be up to 40 day lag between when data is generated and when it's available in the Data Lake.

For many analyses, this is fine, but for others, like an A/B test or a crisis like a spam attack or community protest during the first weeks of a month, it's a problem.

In many cases, it is possible to use EventBus data, which does arrive in real time. However, even when the data is passing through EventBus, this approach has its own limitations. For example, EventBus data misses out of much of the valuable extra data that the monthly datasets provide (e.g. event_user_seconds_since_previous_revision , event_user_is_created_by_system, geolocation from geoeditors_daily), and the EventBus data may not extend far back in time (e.g. the revision_tag_create table only goes back to September 2018: T201062).

It's theoretically possible to combine both EventBus and mediawiki_history data in a single analysis, but having to draw on two different sources with quite disparate schemas for the same data adds a lot of complexity.

In addition, in many cases analysis has to be done on tables which only provide information about present state. For example, to provide daily information on how many pages contain maps, the raw page_props must be queried every day and the results stored. Even if the page_props table was loaded into the Data Lake once a month, the Data Lake still wouldn't be sufficient for this analysis.

Private data

We have various uses for private data, which generally isn't included in Edit data in the Data Lake, because it's pulled from the public wiki replicas in the Cloud Services cluster:

  • Pulling user email addresses to contact them for a survey or other research project
  • Getting data about revision deleted revisions

Quick lookups

It's often important to look up individual rows of user data for exploration or to diagnose data inconsistencies (e.g. T221338, which required dozens of these lookups to diagnose). With the MediaWiki replicas, these generally take less than a second; with Hive, a single lookup can take several minutes. Presto might help with this, but failing that, we will need the MediaWiki replicas to facilitate fast lookups.

Related Objects

StatusSubtypeAssignedTask
DeclinedNone
ResolvedJAllemandou
ResolvedJAllemandou
OpenNone
ResolvedJAllemandou
OpenNone
Resolvedodimitrijevic
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedMilimetric
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedJAllemandou
Resolved fdans
ResolvedJAllemandou
DuplicateMilimetric
Resolvedkaldari
ResolvedNone

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application changed the subtype of this task from "Deadline" to "Task". · View Herald TranscriptDec 17 2018, 11:17 PM
nshahquinn-wmf renamed this task from Product Analytics Data Lake needs to Support all Product Analytics data needs in the Data Lake.Dec 19 2018, 3:35 PM

During our check-in with @Nuria today, I briefly mentioned the current use case I have for getting data from MariaDB. Let me describe that use case and how it's connected to what the Growth Team is doing.

One of the Growth Team's current projects is Personalized First Day (T206365). We've deployed a short survey to the Czech and Korean Wikipedias, and are serving it up to a random 50% of users who register an account there (meaning autocreated accounts are excluded). Group assignments and responses are stored in the MediaWiki database using the user_property table (with up_property = "welcomesurvey-responses"), as JSON objects in the up_value column.

The survey currently serves two purposes: 1) to gather responses from newly registered users on Czech and Korean WP, and 2) to test whether this survey leads to a significant reduction in editor activation rate. Before the survey was deployed, we wrote up an experiment plan where we describe this, and also outline a set of leading indicators and plans of actions to allow us to react in case something appears to not work right. We did this because we need a month of data to be able to determine if editor activation rate is affected, but we also did not want to wait a month if something was wrong. Once a week, I'm grabbing this data from MariaDB to check whether our leading indicators are triggered and update the team on how things are going (and currently things are going very well).

One way this data need could be served by the Data Lake would be to sqoop the relevant data once a week, so I could query it there instead.

@nettrom_WMF thanks for the notes, I have to say that this is the first time i heard of a survey being persisted to MW, the other major ones we have run (the ones for "why do we read wikipedia?" and the current ongoing ones for performance study) all do get stored in eventlogging, You can see them on hive on events database.

First, I agree with @Nuria that ideally we would find one good solution for surveys and stick with it as a rule. EventLogging seems to work well for other surveys, would it work for you? If so, we could do a one-time import of all the relevant rows from user_properties into the schema.

Second, the user_properties table is one of the most fragile and ephemeral collections of data we have. Other things like categorylinks or pagelinks can in theory be restored historically by going over the raw wikitext. But rows from user_properties are just overwritten when the up_value changes, and there's no way to ever get the old value. For this reason, I would love to hear more about what data analysis needs are against that table, and think together about how to migrate them away or solve them.

First, I agree with @Nuria that ideally we would find one good solution for surveys and stick with it as a rule. EventLogging seems to work well for other surveys, would it work for you? If so, we could do a one-time import of all the relevant rows from user_properties into the schema.

The ultimate purpose of collecting this data is to personalize new users' experiences based on their background and interests (hence the name "personalized first day" 😁), so the data will need to be accessible to MediaWiki. That's why they're using user_properties rather than EventLogging.

In T212172#4842701, @Neil_P._Quinn_WMF wrote:

First, I agree with @Nuria that ideally we would find one good solution for surveys and stick with it as a rule. EventLogging seems to work well for other surveys, would it work for you? If so, we could do a one-time import of all the relevant rows from user_properties into the schema.

The ultimate purpose of collecting this data is to personalize new users' experiences based on their background and interests (hence the name "personalized first day" 😁), so the data will need to be accessible to MediaWiki. That's why they're using user_properties rather than EventLogging.

I do not remember the specific discussions for why it was done this way, but the team is currently looking at how to use the survey results to tailor the new user experience, making that a likely explanation for why it's stored there. @SBisson is the main engineer who worked on it, I'm sure he can pop in and confirm that and/or let us know if there are other reasons.

The ultimate purpose of collecting this data is to personalize new users' experiences based on their background and interests (hence the name "personalized first day" 😁), so the data will need to be accessible to MediaWiki. That's why they're using user_properties rather than EventLogging.

I do not remember the specific discussions for why it was done this way, but the team is currently looking at how to use the survey results to tailor the new user experience, making that a likely explanation for why it's stored there. @SBisson is the main engineer who worked on it, I'm sure he can pop in and confirm that and/or let us know if there are other reasons.

Thank you both for pointing this out. That makes sense, but I am interested in what @SBisson thinks. There's always the possibility of running the survey with EventLogging and then updating user_properties separately.

Here's another use case that came up during the analysis of the survey results. I was asked if I could figure out what proportion of users who didn't supply an email address at registration added one in the survey.

From what I know, information about when the user added their email address is not stored directly. There's the PrefUpdate schema, but the user's email address isn't a standard user preference so the code for adding/updating it is separate and doesn't result in a data point captured by that schema. And if it did, I'm not sure if logging of that schema would be triggered at registration.

To solve this, I compared the user_registration and user_email_token_expires columns in the user table and found that the difference between them is often a week plus a little bit. That little bit is typically less than ten seconds, meaning I could apply a heuristic that if the difference was more than ten seconds the account was unlikely to have been added at registration.

Similarly as the previous use case, this analysis would be done on a weekly basis, so scooping the user table weekly for the two wikis would work.

Here're some use cases from my work for the iOS app team:

  • Of course, as @Neil_P._Quinn_WMF mentioned in T161149, I will need change_tag and change_tag_def to figure out which edits are made through the iOS app.
  • A lot of the edits through the app are anonymous. To get a rough idea about how many anonymous editors we have, I need to count the unique user agent + IP using the [[ https://www.mediawiki.org/wiki/Extension:CheckUser/cu_changes_table | cu_changes ]] table. We also use the user agent in the cu_changes table to figure out how many edits are done on iPhone vs iPad.
  • For Wikidata short description edits made through the app, revision_comment_temp and comment tables are needed to figure out which language this edits is.

The iOS app has new release once a month on average, sometimes more often. Given that the team is going to release more mobile editing features in Q3&Q4 and next fiscal year, I would probably need to run the queries after every release, and also for the monthly board meetings to show the impact.

@chelsyx FYI that the cu_changes table is on wmf_raw as mediawiki_private_cu_changes this data is used for this dataset: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors

Thanks @Nuria . I'm aware that mediawiki_private_cu_changes is on wmf_raw, but to my understanding it is scooped to wmf_raw at the beginning of every month, right?

Milimetric raised the priority of this task from Medium to High.
Milimetric moved this task from Incoming to Data Quality on the Analytics board.
Ottomata added a subscriber: Ottomata.

The ultimate purpose of collecting this data is to personalize new users' experiences based on their background and interests (hence the name "personalized first day" 😁), so the data will need to be accessible to MediaWiki. That's why they're using user_properties rather than EventLogging.

I do not remember the specific discussions for why it was done this way, but the team is currently looking at how to use the survey results to tailor the new user experience, making that a likely explanation for why it's stored there. @SBisson is the main engineer who worked on it, I'm sure he can pop in and confirm that and/or let us know if there are other reasons.

Thank you both for pointing this out. That makes sense, but I am interested in what @SBisson thinks. There's always the possibility of running the survey with EventLogging and then updating user_properties separately.

We considered 3 storage options for the Welcome survey: 1) EventLogging, 2) a user preference, 3) custom table(s) in the MediaWiki database.

While it is currently called a "survey", this data is part of the user's profile. We plan to use it to customize what we present the user on wiki and probably allow them to update it. For those reasons we thought options 2 and 3 were appropriate and we went with 2 for simplicity.

We could definitely send the data to both EventLogging (for analysis) and user preference (for reuse within MW) if it helps.

Here're some use cases from my work for the iOS app team:

  • Of course, as @Neil_P._Quinn_WMF mentioned in T161149, I will need change_tag and change_tag_def to figure out which edits are made through the iOS app.
  • For Wikidata short description edits made through the app, revision_comment_temp and comment tables are needed to figure out which language this edits is.

From my side, Android metrics are calculated with the same data. Change tag data is the most important for us to have available in Data Lake.

Furthermore, right now we're tracking prevalence of Wikimedia Maps (on wikis where that product is enabled) on a daily basis: http://discovery-beta.wmflabs.org/maps/#kartographer_prevalence

Screen Shot 2019-01-09 at 10.22.05 AM.png (569×1 px, 107 KB)

Currently, the only way to get day granularity is to make sure the query runs every day on the real-time data because there's no way to backfill from page properties table (from which these statistics are computed) as there are no timestamps. If I remember correctly, the prevalence metric has been used by executives to make decisions regarding support for Wikimedia Maps, so it's important to keep it around in some form. There are two possibilities:

  • To maintain the day granularity of prevalence metric, we'd need something like T212386
  • If day granularity is not important (which I guess would be up to @jmatazzoni & @Catrope to decide, as last I heard they're the owners/maintainers of that product but feel free to correct me), then we can easily switch to calculating prevalence on monthly basis (Reportupdater supports this afaik) using the Data Lake snapshots assuming page_props data is added to mediawiki_history* in some way.

Nevermind, per T170022#4800915 & T170022#4866564 I guess there's nobody actually managing Maps and RI is just doing maintenance and fixing critical bugs.

So I guess it's up to me then to make the call regarding prevalence stats. I'm okay switching to monthly granularity and using data lake if page properties are imported. Otherwise I still need a way to access the replica shards easily.

I was working on another ad-hoc analysis case a couple of days ago where I needed information about when a specific abuse filter was in effect. This abuse filter is hidden, meaning I couldn't access its history on-wiki, but I have access to that information in the MW database. In this case, it was also a recently updated filter, meaning having access to up-to-date information was needed.

In a similar vein, while doing analysis of what users are doing after registration, one of the questions that came up was "did their edit attempt trigger an abuse filter?" For that analysis, up-to-date information wasn't necessary as I was working with data through mid-December.

Either way, access to information about restrictions a given wiki had in effect and how it affected users seems key. Page protections and blocks are already somewhat present, I'm not sure if abuse filters are.

nshahquinn-wmf renamed this task from Support all Product Analytics data needs in the Data Lake to Provide feature parity between the wiki replicas and the Analytics Data Lake.Jan 14 2019, 6:41 AM

It seems we have collected enough use cases already to facilitate the present discussion, but to briefly sketch another current example:

Generate and tracking the metrics required or the web team's Advanced Mobile Contributions project (T210660) will involve the following:

  • accessing change tags (the existing mobile web editand a newly created advanced mobile edit tag)
  • also regarding the logging table (some of the "moderation actions" in this project are defined/counted via the logs)
  • aggregation across 15 target wikis (regarding T212386)

Having to rely on a monthly snapshot for testing the queries initially (when the new tag is being deployed) and for evaluating the effect of ongoing product changes would introduce considerable delays.

Quick note to thank everyone very much for all these use cases. They're very useful for both the short and long term planning that's always spinning in my brain. Thank you!

Quick note to thank everyone very much for all these use cases. They're very useful for both the short and long term planning that's always spinning in my brain. Thank you!

And thank you for the excellent discussion and for always being sympathetic to our needs (even when it's not possible to meet them all)! 😁

mforns lowered the priority of this task from High to Medium.Mar 7 2019, 6:12 PM

Regarding the quick-lookups, I suggest using spark in shell mode (whether in python or in scala):

  • Extract the subset of data you're after and register it as a temporary table (spark.sql("SELECT * from wmf.mediawiki_history WHERE snapshot = '2019-03' and wiki_db = 'mywiki' and page_title = 'a title'").createOrReplaceTempView("myview")
  • Cache the view for fast access: spark.table("myview").cache()
  • Access the data as needed: spark.sql("SELECT count(1) from myview").show()

With the above solution, the first access (reading data and caching the table) takes some time (a few minutes max I'd say) then other requests to myview are subsecond.

Wikipedia-to-Wikidata linkage patterns (T209891#4798717, using the page_props table)

I have a use-case for this. I would like to use Wikidata to identify edits to articles that are biographies and metadata about the people. Perhaps with some scaffolding I will be able to write the code to add wikidata_id to mediawiki_history and / or mediawiki_page_history.

Just an FYI in this ticket that while we will be working to fix quality bugs in mw like: T230915 it is not likely we will be able to add more tables to the reconstruction.
The reconstruction has reached a level of complexity that we do not think we can increase if we want to maintain a reliable delivery schedule of data. Going towards the future our efforts will probably revolve more around how to use event data to enrich the dataset.

Now, table scooping will continue happening, that means that we will have more mediawiki tables in hadoop which eases querying considerably. See for example a new change on this regard: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/553405/

The main point of this task was to explain the data access needs of the Product Analytics team in the context of T172410. That migration is long complete, and things have been working well for us. Actually providing feature parity between the wiki replicas and the Analytics Data Lake as this task literally states is not a feasible goal.