Page MenuHomePhabricator

Release edit data lake data as a public json dump /mysql dump, other?
Closed, ResolvedPublic8 Estimated Story Points

Description

Release edit data lake data publicy

Event Timeline

We think the research community can benefit from the edit data lake data in the form of a somewhat large text dump that contains json.

This dump will contain edit data denormalized for easy analytics calculations

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_user_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_page_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Metrics

Adding some subscribers that can provide input as to whether this is a good idea or not.

fdans lowered the priority of this task from High to Medium.
fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
fdans subscribed.

Let's make sure to collect user cases for this and talk to research

Nuria renamed this task from Release edit data lake data as a public json dump to Release edit data lake data as a public json dump /mysql dump, other?.May 28 2019, 9:18 AM
Nuria updated the task description. (Show Details)
Ottomata raised the priority of this task from Medium to High.
Ottomata added a project: Analytics-Kanban.

@leila / @nettrom_WMF: fyi I'm working on this now. I've started a draft page where I'm thinking out loud about how to publish: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Public

Any thoughts / use cases / wishes are welcome.

@Milimetric nice to see that we're here. :) I did one pass over the wikitech page. What kind of input is most helpful for you?

@leila Best would be uses cases and how would you expect to use this data which informs our ideas as to how to release it.

@Milimetric @Nuria: Ok. And by when do you need this input?

@leila we are working on this for the next couple of weeks so the sooner the better

@leila: we can of course iterate on the format in the future. Eventually we'll have a public API to query the whole dataset. But for now we just want some idea of common / high priority use cases that we can try to serve with a simpler release. Thank you so much for looking into it.

Rough draft of a blurb about why this dataset is useful:

NOTE: A history of activity on Wikimedia projects as complete and research-friendly as possible. We add context to edits, such as whether they were reverted, when they were reverted, how many bytes they changed, how many edits had the user made at that time, and much more, all in the same row as the edit itself. So you can focus more on what you want to find out instead of spending your time joining tables, writing code, and pre-processing large amounts of data.

(updated note per @nettrom_WMF's suggestions below)

Change 528504 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] [WIP] Add spark job to create mediawiki history dumps

https://gerrit.wikimedia.org/r/528504

+1 to @leila's rough draft. Could add something to the last sentence to emphasize that these datasets removes the need for additional processing, to make the point about adding context stronger? E.g:

So you can focus more on what you want to find out instead of spending your time joining tables, writing code, and pre-processing large amounts of data.

One thing I was thinking about with regards to the file format of the dumps is that a lot of the Wikipedia research I've seen that studies multiple wikis select them based on their number of articles. I suspect that largely correlates to amount of activity, meaning that if we group wikis by number of articles rather than number of events the list looks mostly the same.

One thing we'd definitely want is some place to look up where to find a given wiki, though, so those who want to use these datasets can easily figure out which files to get based on what wiki they're studying.

Change 530002 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add Oozie job for mediawiki history dumps

https://gerrit.wikimedia.org/r/530002

See the final format of the dumps, chosen after the community survey, here: T224459#5491080

Change 528504 merged by jenkins-bot:
[analytics/refinery/source@master] Add spark job to create mediawiki history dumps

https://gerrit.wikimedia.org/r/528504

Change 530002 merged by Joal:
[analytics/refinery@master] Add Oozie job for mediawiki history dumps

https://gerrit.wikimedia.org/r/530002

Change 538312 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Rsync analytics mediawiki history dumps to dumps.wikimedia.org

https://gerrit.wikimedia.org/r/538312

How big are these dumps for one set, and how many sets do we intend to keep? Adding @Bstorm since the host behind dumps.wikimedia.org is a WMCS server.

@ArielGlenn thanks for chiming in!

How big are these dumps for one set, and how many sets do we intend to keep?

Each dump set consists of about 2000 files, the biggest of them isn't bigger than ~2GB.
The total size of a dump set is around 440GB. And we'd like to keep 4 to 6 of them if possible.
Do you see any issues in that?
Cheers!

That size seems fine for the current disk available.

Change 539151 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::data_purge: Add timer to delete old MWH dumps

https://gerrit.wikimedia.org/r/539151

Change 538312 merged by Ottomata:
[operations/puppet@production] Rsync analytics mediawiki history dumps to dumps.wikimedia.org

https://gerrit.wikimedia.org/r/538312

Change 539374 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] dumps::manifests::web::fetches::stats: correct path for mediawiki history

https://gerrit.wikimedia.org/r/539374

Change 539374 merged by Ottomata:
[operations/puppet@production] dumps::manifests::web::fetches::stats: correct path for mediawiki history

https://gerrit.wikimedia.org/r/539374

Change 540442 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] web::fetches::stats.pp Absent mediawiki history rsync

https://gerrit.wikimedia.org/r/540442

Change 540442 merged by Ottomata:
[operations/puppet@production] web::fetches::stats.pp Absent mediawiki history rsync

https://gerrit.wikimedia.org/r/540442

Nuria set the point value for this task to 8.

Change 539151 merged by Ottomata:
[operations/puppet@production] analytics::refinery::job::data_purge: Add timer to delete old MWH dumps

https://gerrit.wikimedia.org/r/539151

Change 559926 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::data_purge: Correct timer syntax

https://gerrit.wikimedia.org/r/559926

Change 559926 merged by Ottomata:
[operations/puppet@production] analytics::refinery::job::data_purge: Correct timer syntax

https://gerrit.wikimedia.org/r/559926