Page MenuHomePhabricator

Load change tags into the Analytics Data Lake on a daily basis
Closed, ResolvedPublic

Description

The Product Analytics team has been trying to figure out how we can calculate our annual plan metrics in near-real-time using the Analytics Data Lake. We could use the MariaDB analytics replicas, but we're reluctant to invest time in a system that will be deprecated soon (T172410).

It seems like we can use the EventBus logs (e.g. mediawiki_revision_create) for real-time-ish edit data, but that data still doesn't include edit tags. The ideal would be for those logs to include edit tags natively, but to make things simpler, can we just have the change_tag tables loaded into the Data Lake daily as a separate table?

This means we analysts would take care of the details of joining it to other data sources and adapting to the upcoming schema changes (T185355) when they happen, which hopefully makes this easy to accomplish.

Theoretically, a revision's change tags can be changed by users at any time, but the tags we're interested in are software-set which means we can rely on them to be set initially and not change afterwards. So it would be sufficient to append a day's new rows rather than reloading the entire table everyday, except for the fact that the schema changes will require a complete reload when they occur.

The schema can be similar to the MediaWiki tables in wmf_raw, where it's identical to the original schema except an additional wikidb field.

Setting up a workflow for calculating our annual plan metrics is an important priority for us, so it would be extremely helpful if this could be done within the next 2 weeks (by August 24).

Note that T161149 is a separate, still-open request; this is a stopgap until (1) change tags are integrated in mediawiki_history and (2) mediawiki_history is updated on a closer to real-time basis.

Event Timeline

A related ask (but probably should sit in a new ticket): Is it possible to backfill the tag? We started to include tags ios app edit and android app edit from June 29 (T194424), but it would be great if we can backfill these tags for older revisions.

nshahquinn-wmf renamed this task from Load change_tag tables in Analytics Data Lake daily to Load change_tag tables into the Analytics Data Lake on a daily basis.Aug 2 2018, 7:22 PM
nshahquinn-wmf updated the task description. (Show Details)

Message received that this is a high priority and time sensitive. @Neil_P._Quinn_WMF the proposal to sqoop daily on an incremental basis is ok, but we may have to do it ad-hoc. Our current scripts just sqoop everything, so it's not as simple as adding another table to the list. But it's not complicated either.

@chelsyx I see ios app edit and android app edit in the change_tag table, so by backfill do you mean just import all of change_tag at first? That's what we would do, unless you mean there's some other source of tags that aren't in that mediawiki table as of today.

@Milimetric I mean the latter one. I'm wondering if we save the original user agent strings of each revision somewhere else, so that we can backfill the ios app edit and android app edit tags for those revisions which only have mobile app edit tag in the change_tag table.

I see @chelsyx. We only save the user agent string in one place, the recentchanges and cu_changes tables. Those are sqooped and processed for the new geoeditors dataset, so there's some parsed user agent data per user-id / user-ip. But all of this is deleted after 90 days. So we don't have data going back too far. We would need to make a job that would look at these two datasets together and convert the tags if necessary. That could be done as part of the sqoop. I'll add that code here and you can review it if I forget.

I see @chelsyx. We only save the user agent string in one place, the recentchanges and cu_changes tables. Those are sqooped and processed for the new geoeditors dataset, so there's some parsed user agent data per user-id / user-ip. But all of this is deleted after 90 days. So we don't have data going back too far. We would need to make a job that would look at these two datasets together and convert the tags if necessary. That could be done as part of the sqoop. I'll add that code here and you can review it if I forget.

If it's done as part of the sqoop, I assume the resulting data would only be present in that version of the sqooped table so that, for example, it would disappear when we have to reload the entire table to account for the schema change. That would still let us use it for a baseline now, but ideally, if we go to the trouble of backfilling, we could save those tags permanently in the canonical tables.

The action API does have a module for altering change tags, so this wouldn't necessarily need to be done through a database script.

@chelsyx, how valuable would permanent storage of the backfilled tags be?

We talked about this today, and here's our plan:

  • Daily incremental sqoops are too hard to squeeze in before the 24th, without derailing our other goals for the quarter
  • We will add a monthly sqoop of change_tag so you have it in wmf_raw
  • you can do the analyses that you need on top of that monthly load
  • In the meantime, when we're more ok with our other goals work, we'll figure out how to do the incremental daily sqooping. (If it's ok with the DBAs and we don't run into weird data problems)
  • We'll add another task to do the change tag transformation that Chelsy mentioned (mobile app edit -> ios/android app edit)
  • We'll add another task to integrate the change tags into the large denormalized table

Some questions:

  • wouldn't we have to also bring in revisions/archive daily, otherwise you have nothing to join to? We can't do this with sqoop accurately because the revision table has inserts in the past as you saw when you were looking at the mediawiki_revision_create EventBus data
  • isn't there a change tag redesign underway? Do you know how it would affect this?

Another option you have is to work with someone like Gergo to add the change tags to the mediawiki_revision_create event, we don't own that and any mediawiki developer should be able to do it faster than we can, if they have more time available before the 24th.

Or another option is to do something ad-hoc like I mentioned before. We could run an incremental sqoop into some separate table and run the analysis on top of that. I'm rethinking this, it seems too dangerous: if it fails in some subtle way and we're not giving it the attention it needs, the resulting data could mislead a lot of people.

Another option you have is to work with someone like Gergo to add the change tags to the mediawiki_revision_create event, we don't own that and any mediawiki developer should be able to do it faster than we can, if they have more time available before the 24th.

Adding tags (if they are available to the event...they probably are) would be pretty easy. But, for completeness, and since as you say the tags can change later on, perhaps we could make a new event here? mediawiki.revision-tag-change ?

Okay, that's quite disappointing. I think we need to discuss this more in our team meeting today, but some initial comments:

  • In the meantime, when we're more ok with our other goals work, we'll figure out how to do the incremental daily sqooping. (If it's ok with the DBAs and we don't run into weird data problems)

What kind of timeframe are we talking about here? End of August? End of October? End of December?

  • We'll add another task to do the change tag transformation that Chelsy mentioned (mobile app edit -> ios/android app edit)

That's a nice-to-have so please don't focus on it to the detriment of the main goal: getting a daily flow of change tags into the data lake.

Some questions:

  • wouldn't we have to also bring in revisions/archive daily, otherwise you have nothing to join to? We can't do this with sqoop accurately because the revision table has inserts in the past as you saw when you were looking at the mediawiki_revision_create EventBus data

Is there a reason we can't join the sqooped table to the mediawiki_revision_create logs?

  • isn't there a change tag redesign underway? Do you know how it would affect this?

As I wrote in the description, I'm proposing you just sqoop the table as-is—so when the change happened, you'd do a one-time full resqoop and let us deal with updating our queries to match the new format. That was part of my attempt to make this request as simple as possible to maximize the chances it could be done soon. If you have other suggestions for simplifications, please make them!

Another option you have is to work with someone like Gergo to add the change tags to the mediawiki_revision_create event, we don't own that and any mediawiki developer should be able to do it faster than we can, if they have more time available before the 24th.

@Tgr, do you have a sense of how quickly you could do something like this?

Or another option is to do something ad-hoc like I mentioned before. We could run an incremental sqoop into some separate table and run the analysis on top of that. I'm rethinking this, it seems too dangerous: if it fails in some subtle way and we're not giving it the attention it needs, the resulting data could mislead a lot of people.

What do you mean by "some separate table"? What I proposed here was a sqoop into a table separate from mediawiki_history or mediawiki_revision_create—is there a level more more separate than that? Also, we have to weight the danger of data quality issues against the danger of having no data (or only inaccessible or outdated data) to guide our work...

Change tags are added after the revision has been created and saved (and they can be added or removed much later, although that probably does not really happen in practice). So that would be some major refactoring, if at all desirable.

What's easily doable (an hour to write, probably a week to deploy?) is to create a new change_tag event. Probably just implement the ChangeTagsAfterUpdateTags hook (and maybe ChangeTagAfterDelete, if it's worth the effort) in EventBus and trigger the event(s). (Not sure if something else needs to be done on the changeprop side, @mobrovac or @Pchelolo could tell.)

You could then join the change_tag events to the revision by revid, and would have to aggregate the effects manually. (Or just join, merge the added tags, and assume that tags are never removed, which is probably close enough.)

Not sure if something else needs to be done on the changeprop side

change-prop wouldn't need to be involved, unless the event needs to trigger something outside of Mediawiki to happen. I tend to agree that adding a new revision-tag-change (name TBD) event in EventBus would be easiest. Then you'd get the events populated about hourly just like mediawiki_revision_create.

+1 on creating such an event, it sounds like a useful piece of information for clients to have/be able to react to.

Not sure if something else needs to be done on the changeprop side

change-prop wouldn't need to be involved, unless the event needs to trigger something outside of Mediawiki to happen.

That is correct, but ...

I tend to agree that adding a new revision-tag-change (name TBD) event in EventBus would be easiest. Then you'd get the events populated about hourly just like mediawiki_revision_create.

.. how would this event be consumed then? Given the low volume, wouldn't it the easiest thing to do be to instruct change-prop to fire an http req on each tag change?

Ah no, this would just be produced into Kafka via EventBus. We automatically consume and refine all of the Mediawiki EventBus events into Hive.

  • In the meantime, when we're more ok with our other goals work, we'll figure out how to do the incremental daily sqooping. (If it's ok with the DBAs and we don't run into weird data problems)

What kind of timeframe are we talking about here? End of August? End of October? End of December?

After the above discussion and thinking some more, I agree it makes the most sense to just add a new event and get the data that way. We can sqoop change tags as they are monthly and use the events for up-to-date reporting. I think this can happen pretty quickly, depending on making the schema for the new event and instrumenting one of the hooks Gergo mentioned.

  • We'll add another task to do the change tag transformation that Chelsy mentioned (mobile app edit -> ios/android app edit)

That's a nice-to-have so please don't focus on it to the detriment of the main goal: getting a daily flow of change tags into the data lake.

Oh yeah, that would be for later.

Some questions:

  • wouldn't we have to also bring in revisions/archive daily, otherwise you have nothing to join to? We can't do this with sqoop accurately because the revision table has inserts in the past as you saw when you were looking at the mediawiki_revision_create EventBus data

Is there a reason we can't join the sqooped table to the mediawiki_revision_create logs?

This is where it gets complicated because revision and archive and tags all change over time. All the proposals so far would have some level of inaccuracy due to those changes not propagating into analytics dbs at the same frequency. And I'm not sure what percent inaccuracy is ok for your analysis.

  • isn't there a change tag redesign underway? Do you know how it would affect this?

As I wrote in the description, I'm proposing you just sqoop the table as-is—so when the change happened, you'd do a one-time full resqoop and let us deal with updating our queries to match the new format. That was part of my attempt to make this request as simple as possible to maximize the chances it could be done soon. If you have other suggestions for simplifications, please make them!

Ok, sounds good, so the remaining problem is changes over time and how they affect results of analysis.

Or another option is to do something ad-hoc like I mentioned before. We could run an incremental sqoop into some separate table and run the analysis on top of that. I'm rethinking this, it seems too dangerous: if it fails in some subtle way and we're not giving it the attention it needs, the resulting data could mislead a lot of people.

What do you mean by "some separate table"? What I proposed here was a sqoop into a table separate from mediawiki_history or mediawiki_revision_create—is there a level more more separate than that? Also, we have to weight the danger of data quality issues against the danger of having no data (or only inaccessible or outdated data) to guide our work...

Sorry, I was being unclear. I just meant an ad-hoc sqoop that we could run without merging code and monitoring the job and going over the data quality and all that. When I think about this data I just don't have the sense that it would be stable enough if we sqooped only new data, and if we relied on EventBus data. Here's an example of problems I see:

  • revisions come in through EventBus on revision_create
  • pages get deleted, revisions moved to Archive table, that doesn't reflect in the data
  • tags come in, on deleted pages, we join to outdated data
  • old tags change, we don't update the results because they're old records that were already sqooped
  • pages get restored in the past, not even sure what happens to tags but we see those as new revisions in EventBus, so we double-count them

It's just kind of a mess and without doing a deep dive I don't have a sense how far off the join of all those messes would force your results to be. Maybe we could hang out and look at what questions you need to answer. With a new change_tag_change event, maybe there's a better solution.

pages get restored in the past, not even sure what happens to tags but we see those as new revisions in EventBus, so we double-count them

revisions belonging to restored pages emit new revision-create events when the page is restored?!?

At a glance there's just a page.delete / page.undelete event (or mediawiki.revision-visibility-change if it's revision deletion).

revisions belonging to restored pages emit new revision-create events when the page is restored?!?

Yes T166351 and as a gateway to fix it T188396

I'll make a schema proposal shortly

I am also of the opinion that adding a revision-tag-change event is the easiest thing to do if those tags can easily be joined with a revision (and I think that @Tgr proposal says that they can be easily joined). While we can add the tag table to our monthly sqoop we - at this time- do not run any incremental updates for sqooped data, that is a harder problem that it seems at first sight and it is unlikely we will have the bandwith to work on it this quarter.

Change 451030 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/event-schemas@master] Create mediawiki-revision-tags-change event schema.

https://gerrit.wikimedia.org/r/451030

The proposed schema falls into the revision hierarchy, so it will inherit from the revision-create schema in order to follow our guidelines. Apart from that, it will contain the new list of tags and the prior_state object with an old list of tags, again, because that's how we do it in other *-change events.

Change 451056 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/EventBus@master] Emit mediawiki.revision-tags-change event.

https://gerrit.wikimedia.org/r/451056

nshahquinn-wmf renamed this task from Load change_tag tables into the Analytics Data Lake on a daily basis to Load change tags into the Analytics Data Lake on a daily basis.Aug 7 2018, 6:06 PM

Change 451030 merged by jenkins-bot:
[mediawiki/event-schemas@master] Create mediawiki-revision-tags-change event schema.

https://gerrit.wikimedia.org/r/451030

Change 451056 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Emit mediawiki.revision-tags-change event.

https://gerrit.wikimedia.org/r/451056

Thanks @Pchelolo! This will be super helpful for us.

And thanks @Milimetric as well for the quick reply and clear feedback. They're mostly moot now, but I'll reply to some of your concerns from earlier just for clarification.

  • pages get deleted, revisions moved to Archive table, that doesn't reflect in the data
  • tags come in, on deleted pages, we join to outdated data

In pretty much all our analysis, we union the revision and archive tables, because we generally want to include edits to deleted pages and because we don't want the results of a specific query to changed later based on a deletion decision. So we don't actually want to account for pages getting deleted! And since the change tags table isn't affected by page deletion or undeletion, it wouldn't matter when we sqooped it.

  • old tags change, we don't update the results because they're old records that were already sqooped

Well, the only tags we really care about right now are the main software-set tags (like mobile edit, as opposed to AbuseFilter-defined ones like repeating characters added). So for those, any manual additions and removals (which I imagine are very rare) are not just unimportant—they're almost certainly incorrect.

  • pages get restored in the past, not even sure what happens to tags but we see those as new revisions in EventBus, so we double-count them

The event timestamp is set to the time of restoration, which is a bit confusing, but now that I understand that (thanks to your email from earlier!), I can just use the rev_timestamp field, which accurately reflects the original edit time, so we won't be bothered by these restorations.

Will wait for 21st deploy to confirm data appears on event Database on hive.

After today's deploy of group 0 I've tested some tag adding/removal on a test wiki and it seems the change works. I can prepare a change to expose the event via event streams, do we want that?

I can prepare a change to expose the event via event streams, do we want that?

I do not think so, the task was to have event internally, there is no need to expose it externally.

Yeah, let's wait. Maybe we'll want that, but let's take a bit to make sure everything is as we want it and look at the data to decide if it is something that would be interesting for public consumption.

This data has been accumulating successfully for a month now (for example, see P7612). Thanks, everyone!