At the end of this study, recommend the best way to split the data lake edits data and/or a set of practical requirements.
What are the possible constraints on the partitions?
We will be releasing the whole MediaWiki dataset with user, revision, page infos.
Wikigroup- How to group the different wikis? One file per wiki? One file per large wikis, and grouping the smaller ones? If grouping, what grouping makes sense?
Time- How to split up wikis over time? For large wikis such as enwiki, we may have to do this, for smaller ones, we have a choice. If so, what splitting makes the most sense?
Use-cases- Are there most common use-cases for this data-release that we should be aware of and tune the release to those use-cases?
High-level input is good, too- If you don't know the answer to the question of which wikigroup and over what time period, that is fine. Tell us practical constraints you have: An n GB file is just too big; ...
Q. Why don't you release the data in one file?
A. It's too big and it won't be useful for the majority of the use-cases where only a portion of the data is needed. (Think about those on low bandwidths or with limited processing power.)
Q. Why don't you release a sample of the data for folks to play with and only after you see the actual use-cases, decide the final set of requirements?
A. We may end up doing this, but there are two constraints here: We ideally want to have the data out before the end of September 2019. If we have it out and we don't finalize the format, people will start writing code based on it and if we have to change things, it will mess up their work. We will have to do some of this in the future, but we don't want to rely on iteratively improving the data release format for things we could catch before the release.
Per Capt_Swing's input, we most likely don't need a methodology here. Given the nature of the feedback requested, it is enough to send a simple questionnaire to outside researchers and gather input from internal researchers and make a recommendation/decision.
The following points have been raised by others and are important to keep in mind:
- Whether we want to have data lake as the go-to place for the majority of research use-cases or we want it to complement MySQL DBs will have impact on the type of input we gather. (diego) In conversations with Analytics, the long term goal is identified as having data lake as a place where most of the research data needs are satisfied, but this is much longer term and in the short term the two resources may act as complements.
- What happens to computational resources that are needed by researchers to be able to process data from data lake? (diego) MySQL DBs come hand-in-hand with resources. This can make the work of researchers in under-resources environments easier and can make the data-sets in MySQL DBs more attractive to them in this sense. Longer term plans for Data Lake is encouraged to consider computation resource needs for a volunteer community of researchers to process the data.
Provided by Milimetric and can be used in communications with external researchers/end-users:
A history of activity on Wikimedia projects as complete and research-friendly as possible. We add context to edits, such as whether they were reverted, when they were reverted, how many bytes they changed, how many edits had the user made at that time, and much more, all in the same row as the edit itself. So you can focus more on what you want to find out instead of joining a bunch of tables.