Release edit data lake data publicy
|Open||mforns||T208612 Release edit data lake data as a public json dump /mysql dump, other?|
|Open||mforns||T224459 Recommend the best format to release public data lake as a dump|
We think the research community can benefit from the edit data lake data in the form of a somewhat large text dump that contains json.
This dump will contain edit data denormalized for easy analytics calculations
Adding some subscribers that can provide input as to whether this is a good idea or not.
@leila / @nettrom_WMF: fyi I'm working on this now. I've started a draft page where I'm thinking out loud about how to publish: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Public
Any thoughts / use cases / wishes are welcome.
Rough draft of a blurb about why this dataset is useful:
(updated note per @nettrom_WMF's suggestions below)
+1 to @leila's rough draft. Could add something to the last sentence to emphasize that these datasets removes the need for additional processing, to make the point about adding context stronger? E.g:
So you can focus more on what you want to find out instead of spending your time joining tables, writing code, and pre-processing large amounts of data.
One thing I was thinking about with regards to the file format of the dumps is that a lot of the Wikipedia research I've seen that studies multiple wikis select them based on their number of articles. I suspect that largely correlates to amount of activity, meaning that if we group wikis by number of articles rather than number of events the list looks mostly the same.
One thing we'd definitely want is some place to look up where to find a given wiki, though, so those who want to use these datasets can easily figure out which files to get based on what wiki they're studying.