Page MenuHomePhabricator

Run ETL for wmf_raw.ActionApi into wmf.action_* aggregate tables
Open, MediumPublic

Description

In T116065: Design aggregate tables to drive Action API reports I designed a series of dimensional rollup tables to make reporting on various interesting Action API metrics easier. I have been running an ETL process using cron, bash, Python, and hive on stat1002 since April 2016 that populates tables for that schema in the bd808 database. This workflow needs to be converted to use standard Analytics tools and move to the wmf database.

Note: now running on stat1005 due to stat1002 decomm.

Related Objects

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

Moving to radar for now, but when you prioritize and define this, we can help code the Oozie jobs that would get this done. Just let us konw.

Moving to radar for now, but when you prioritize and define this, we can help code the Oozie jobs that would get this done. Just let us konw.

I can make time to work on it whenever someone has time to help me. I can probably knock out most of it with just a few pointers to wiki pages and/or gerrit commits that show doing something similar. The tasks that get run are really just /usr/bin/hive with proper year, month, day, hour values supplied as cli args.

oh, sweet, then it should be pretty straightforward. Grab me in IRC or a hangout whenever we're both free. As far as examples, all of our oozie code is here (but I will help you navigate it):

https://github.com/wikimedia/analytics-refinery/tree/master/oozie

And all of our oozie documentation is here (warning: XML is present in concentrations that may burn your eyes):

https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Oozie

bd808 triaged this task as Medium priority.Jul 22 2016, 5:05 PM

@Tgr do you have the time and energy to take this task over and finally finish our team goal from Q3 2015/16?

I think I have successfully conned convinced @Tgr to take this on when he gets some time.

Same offer to help with oozie applies to @Tgr, of course, and bonus: I'm now a lot better at setting up oozie jobs.

Change 331100 had a related patch set uploaded (by Gergő Tisza):
[WIP] Add Oozie jobs for wmf_raw.ApiAction -> wmf.action_*

https://gerrit.wikimedia.org/r/331100

@Tgr I'm aiming to review this by the end of the week, please ping me if I slip up.

Addshore changed the task status from Open to Stalled.Aug 28 2018, 7:34 AM
Addshore subscribed.

Just giving this a poke 1 year on as it blocked T174474: ApiAction log in data lake doesn't record Wikibase API actions what's the status here?

Marking as stalled until this is updated.

ping @Anomie or @Tgr: were either of you interested in finishing this change https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/331100/ or should I consider taking it over when I have some time?

Feel free as far as I'm concerned. I don't know anything about Oozie.

To be fair I do not think we will have any time to work on this soon and I am really of the opinion that it would be best if the api team owns the instrumentation code that instruments the api, that way the code will be properly maintained and changes on api code will be propagated to changes in instrumentation. We are happy to help api developers to push the changes through as needed be.

ping @Anomie or @Tgr: were either of you interested in finishing this change https://gerrit.wikimedia.org/r/#/c/analytics/refinery/+/331100/

Not anytime soon; I'm unfamiliar enough with Hive and Oozie that even small changes take very long, and don't have the time for it right now.

Milimetric raised the priority of this task from Medium to Unbreak Now!.Oct 1 2018, 2:47 PM
Milimetric moved this task from Radar to Incoming on the Analytics board.
Milimetric lowered the priority of this task from Unbreak Now! to Medium.Oct 1 2018, 2:48 PM
Ottomata raised the priority of this task from Medium to Needs Triage.Oct 4 2018, 5:16 PM
Ottomata moved this task from Incoming to Radar on the Analytics board.
Milimetric triaged this task as Medium priority.Oct 4 2018, 5:18 PM
EvanProdromou subscribed.

I'm going to take over this ticket until I figure out what's going on and get it routed to the right place.

@EvanProdromou I'm not sure about the details, but you'll want to look at T222267 as well. Data that you need may now be in event.mediawiki_api_request in a better shape? In any case, if you still need to aggregate / package this using oozie, I never had time to take this over like I thought in T137321#4548881 but I'm happy to walk you through it.

Aklapper changed the task status from Stalled to Open.Nov 4 2020, 10:45 PM

The previous comments don't explain who or what (task?) exactly this task is stalled on ("If a report is waiting for further input (e.g. from its reporter or a third party) and can currently not be acted on"). Hence resetting task status, as tasks should not be stalled (and then potentially forgotten) for years for unclear reasons.

(Smallprint, as general orientation for task management:
If you wanted to express that nobody is currently working on this task, then the assignee should be removed and/or priority could be lowered instead.
If work on this task is blocked by another task, then that other task should be added via Edit Related Tasks...Edit Subtasks.
If this task is stalled on an upstream project, then the Upstream tag should be added.
If this task requires info from the task reporter, then there should be instructions which info is needed.
If this task needs retesting, then the TestMe tag should be added.
If this task is out of scope and nobody should ever work on this, or nobody else managed to reproduce the situation described here, then it should have the "Declined" status.
If the task is valid but should not appear on some team's workboard, then the team project tag should be removed while the task has another active project tag.)

It's basically stalled on resourcing. There was recent interest from @Maryana on the data, so perhaps they have a concrete way forward complete with resourcing. As before, I'm here to help and walk anyone through productionizing this job.

I could probably pick this up as a 10%-ish exercise, if someone (@Milimetric?) can spare some code review cycles.

Change 331100 had a related patch set uploaded (by Mholloway; owner: Gergő Tisza):
[analytics/refinery@master] [WIP] Add Oozie jobs for mediawiki_api_request rollup tables

https://gerrit.wikimedia.org/r/331100

I picked up @Tgr's old patch and updated it to consume data from event.mediawiki_api_request, along with some other cosmetic changes. The HQL scripts seem to work well as updated. I also kicked off an Oozie job (using mholloway as the target database) but got cold feet and killed it.

I'm not sure if that patch is the direction we want to keep going in or if (as @Milimetric suggested in Gerrit) we want to follow a more generalized approach to creating data sets like these for distribution, but in any case it was useful in getting to know this stuff.

Ottomata subscribed.

Not related to Event Platform. Removing tag.