Release edit data lake data publicy
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | mforns | T208612 Release edit data lake data as a public json dump /mysql dump, other? | |||
Resolved | mforns | T224459 Recommend the best format to release public data lake as a dump |
Event Timeline
We think the research community can benefit from the edit data lake data in the form of a somewhat large text dump that contains json.
This dump will contain edit data denormalized for easy analytics calculations
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_user_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_page_history
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Metrics
Adding some subscribers that can provide input as to whether this is a good idea or not.
@leila / @nettrom_WMF: fyi I'm working on this now. I've started a draft page where I'm thinking out loud about how to publish: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Public
Any thoughts / use cases / wishes are welcome.
@Milimetric nice to see that we're here. :) I did one pass over the wikitech page. What kind of input is most helpful for you?
@leila Best would be uses cases and how would you expect to use this data which informs our ideas as to how to release it.
@leila: we can of course iterate on the format in the future. Eventually we'll have a public API to query the whole dataset. But for now we just want some idea of common / high priority use cases that we can try to serve with a simpler release. Thank you so much for looking into it.
Rough draft of a blurb about why this dataset is useful:
(updated note per @nettrom_WMF's suggestions below)
Change 528504 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery/source@master] [WIP] Add spark job to create mediawiki history dumps
+1 to @leila's rough draft. Could add something to the last sentence to emphasize that these datasets removes the need for additional processing, to make the point about adding context stronger? E.g:
So you can focus more on what you want to find out instead of spending your time joining tables, writing code, and pre-processing large amounts of data.
One thing I was thinking about with regards to the file format of the dumps is that a lot of the Wikipedia research I've seen that studies multiple wikis select them based on their number of articles. I suspect that largely correlates to amount of activity, meaning that if we group wikis by number of articles rather than number of events the list looks mostly the same.
One thing we'd definitely want is some place to look up where to find a given wiki, though, so those who want to use these datasets can easily figure out which files to get based on what wiki they're studying.
Change 530002 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add Oozie job for mediawiki history dumps
See the final format of the dumps, chosen after the community survey, here: T224459#5491080
Change 528504 merged by jenkins-bot:
[analytics/refinery/source@master] Add spark job to create mediawiki history dumps
Change 530002 merged by Joal:
[analytics/refinery@master] Add Oozie job for mediawiki history dumps
Change 538312 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Rsync analytics mediawiki history dumps to dumps.wikimedia.org
How big are these dumps for one set, and how many sets do we intend to keep? Adding @Bstorm since the host behind dumps.wikimedia.org is a WMCS server.
@ArielGlenn thanks for chiming in!
How big are these dumps for one set, and how many sets do we intend to keep?
Each dump set consists of about 2000 files, the biggest of them isn't bigger than ~2GB.
The total size of a dump set is around 440GB. And we'd like to keep 4 to 6 of them if possible.
Do you see any issues in that?
Cheers!
Change 539151 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::data_purge: Add timer to delete old MWH dumps
Change 538312 merged by Ottomata:
[operations/puppet@production] Rsync analytics mediawiki history dumps to dumps.wikimedia.org
Change 539374 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] dumps::manifests::web::fetches::stats: correct path for mediawiki history
Change 539374 merged by Ottomata:
[operations/puppet@production] dumps::manifests::web::fetches::stats: correct path for mediawiki history
Change 540442 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] web::fetches::stats.pp Absent mediawiki history rsync
Change 540442 merged by Ottomata:
[operations/puppet@production] web::fetches::stats.pp Absent mediawiki history rsync
Change 539151 merged by Ottomata:
[operations/puppet@production] analytics::refinery::job::data_purge: Add timer to delete old MWH dumps
Change 559926 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] analytics::refinery::job::data_purge: Correct timer syntax
Change 559926 merged by Ottomata:
[operations/puppet@production] analytics::refinery::job::data_purge: Correct timer syntax