Page MenuHomePhabricator

Release edit data lake data as a public json dump /mysql dump, other?
Open, HighPublic

Description

Release edit data lake data publicy

Event Timeline

Nuria created this task.Nov 2 2018, 5:52 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 2 2018, 5:52 PM
Nuria added a comment.Nov 2 2018, 6:08 PM

We think the research community can benefit from the edit data lake data in the form of a somewhat large text dump that contains json.

This dump will contain edit data denormalized for easy analytics calculations

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_user_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_page_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Metrics

Adding some subscribers that can provide input as to whether this is a good idea or not.

Nuria added subscribers: Halfak, Nemo_bis.
fdans triaged this task as High priority.Nov 5 2018, 5:27 PM
fdans lowered the priority of this task from High to Normal.
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
fdans added a subscriber: fdans.

Let's make sure to collect user cases for this and talk to research

leila added a subscriber: leila.May 21 2019, 8:12 PM
Nuria renamed this task from Release edit data lake data as a public json dump to Release edit data lake data as a public json dump /mysql dump, other?.May 28 2019, 9:18 AM
Nuria updated the task description. (Show Details)
leila edited projects, added Research-Backlog; removed Research.Jul 11 2019, 4:13 PM
Ottomata raised the priority of this task from Normal to High.
Ottomata added a project: Analytics-Kanban.

@leila / @nettrom_WMF: fyi I'm working on this now. I've started a draft page where I'm thinking out loud about how to publish: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Public

Any thoughts / use cases / wishes are welcome.

Milimetric reassigned this task from Milimetric to mforns.Jul 25 2019, 4:49 PM
Milimetric added a subscriber: Milimetric.
mforns moved this task from In Progress to Paused on the Analytics-Kanban board.Jul 29 2019, 2:52 PM
leila added a comment.Jul 29 2019, 7:12 PM

@Milimetric nice to see that we're here. :) I did one pass over the wikitech page. What kind of input is most helpful for you?

Nuria added a comment.Jul 29 2019, 7:16 PM

@leila Best would be uses cases and how would you expect to use this data which informs our ideas as to how to release it.

leila added a comment.Jul 29 2019, 7:51 PM

@Milimetric @Nuria: Ok. And by when do you need this input?

Nuria added a comment.Jul 29 2019, 7:52 PM

@leila we are working on this for the next couple of weeks so the sooner the better

@leila: we can of course iterate on the format in the future. Eventually we'll have a public API to query the whole dataset. But for now we just want some idea of common / high priority use cases that we can try to serve with a simpler release. Thank you so much for looking into it.

mforns moved this task from Paused to In Progress on the Analytics-Kanban board.Aug 2 2019, 7:01 PM
Milimetric added a comment.EditedAug 2 2019, 7:44 PM

Rough draft of a blurb about why this dataset is useful:

NOTE: A history of activity on Wikimedia projects as complete and research-friendly as possible. We add context to edits, such as whether they were reverted, when they were reverted, how many bytes they changed, how many edits had the user made at that time, and much more, all in the same row as the edit itself. So you can focus more on what you want to find out instead of spending your time joining tables, writing code, and pre-processing large amounts of data.

(updated note per @nettrom_WMF's suggestions below)

Change 528504 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] [WIP] Add spark job to create mediawiki history dumps

https://gerrit.wikimedia.org/r/528504

+1 to @leila's rough draft. Could add something to the last sentence to emphasize that these datasets removes the need for additional processing, to make the point about adding context stronger? E.g:

So you can focus more on what you want to find out instead of spending your time joining tables, writing code, and pre-processing large amounts of data.

One thing I was thinking about with regards to the file format of the dumps is that a lot of the Wikipedia research I've seen that studies multiple wikis select them based on their number of articles. I suspect that largely correlates to amount of activity, meaning that if we group wikis by number of articles rather than number of events the list looks mostly the same.

One thing we'd definitely want is some place to look up where to find a given wiki, though, so those who want to use these datasets can easily figure out which files to get based on what wiki they're studying.

Change 530002 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add Oozie job for mediawiki history dumps

https://gerrit.wikimedia.org/r/530002

mforns moved this task from In Progress to Paused on the Analytics-Kanban board.Fri, Aug 23, 3:47 PM
mforns moved this task from Paused to In Progress on the Analytics-Kanban board.Thu, Sep 12, 5:11 PM

See the final format of the dumps, chosen after the community survey, here: T224459#5491080

Change 528504 merged by jenkins-bot:
[analytics/refinery/source@master] Add spark job to create mediawiki history dumps

https://gerrit.wikimedia.org/r/528504

Change 530002 merged by Joal:
[analytics/refinery@master] Add Oozie job for mediawiki history dumps

https://gerrit.wikimedia.org/r/530002