Page MenuHomePhabricator

Recommend the best format to release public data lake as a dump
Closed, ResolvedPublic


At the end of this study, recommend the best way to split the data lake edits data and/or a set of practical requirements.

What are the possible constraints on the partitions?
We will be releasing the whole MediaWiki dataset with user, revision, page infos.

Wikigroup- How to group the different wikis? One file per wiki? One file per large wikis, and grouping the smaller ones? If grouping, what grouping makes sense?

Time- How to split up wikis over time? For large wikis such as enwiki, we may have to do this, for smaller ones, we have a choice. If so, what splitting makes the most sense?

Use-cases- Are there most common use-cases for this data-release that we should be aware of and tune the release to those use-cases?

High-level input is good, too- If you don't know the answer to the question of which wikigroup and over what time period, that is fine. Tell us practical constraints you have: An n GB file is just too big; ...

Q. Why don't you release the data in one file?
A. It's too big and it won't be useful for the majority of the use-cases where only a portion of the data is needed. (Think about those on low bandwidths or with limited processing power.)

Q. Why don't you release a sample of the data for folks to play with and only after you see the actual use-cases, decide the final set of requirements?
A. We may end up doing this, but there are two constraints here: We ideally want to have the data out before the end of September 2019. If we have it out and we don't finalize the format, people will start writing code based on it and if we have to change things, it will mess up their work. We will have to do some of this in the future, but we don't want to rely on iteratively improving the data release format for things we could catch before the release.

Per Capt_Swing's input, we most likely don't need a methodology here. Given the nature of the feedback requested, it is enough to send a simple questionnaire to outside researchers and gather input from internal researchers and make a recommendation/decision.

The following points have been raised by others and are important to keep in mind:

  • Whether we want to have data lake as the go-to place for the majority of research use-cases or we want it to complement MySQL DBs will have impact on the type of input we gather. (diego) In conversations with Analytics, the long term goal is identified as having data lake as a place where most of the research data needs are satisfied, but this is much longer term and in the short term the two resources may act as complements.
  • What happens to computational resources that are needed by researchers to be able to process data from data lake? (diego) MySQL DBs come hand-in-hand with resources. This can make the work of researchers in under-resources environments easier and can make the data-sets in MySQL DBs more attractive to them in this sense. Longer term plans for Data Lake is encouraged to consider computation resource needs for a volunteer community of researchers to process the data.

Provided by Milimetric and can be used in communications with external researchers/end-users:

A history of activity on Wikimedia projects as complete and research-friendly as possible. We add context to edits, such as whether they were reverted, when they were reverted, how many bytes they changed, how many edits had the user made at that time, and much more, all in the same row as the edit itself. So you can focus more on what you want to find out instead of joining a bunch of tables.

Event Timeline

fdans moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
leila renamed this task from Evaluate best format to release public data lake as a dump to Recommend the best format to release public data lake as a dump.Aug 6 2019, 1:32 AM
leila updated the task description. (Show Details)

Dan, Marcel, and I met on 2019-08-02 and discussed this task. I updated the task description based on that conversation. Dan and Marcel: if you see something missing, please correct it.

Next steps:
I'll talk with Jonathan to see if we can work on this during Q1. That would be the idea case as we can avoid the need for major changes in the future if we can have a thorough user research and provide recommendations to you based on that. Diego is also reviewing the task to see if he can provide immediate feedback. I hope that by the end of this week we have an answer about the next steps for you.

Hey folks, I thought it might be useful if I shared some thoughts here.

When I first started analyzing Wikipedia edit histories, the first thing I did was figure out a strategy for getting the data out of the giant XML files to a place where I could use it. XML was hard to process at the time and building simple tables (read, CSV files and sometimes loading them into MySQL and PostgreSQL) made the work of processing the data much more straightforward. I often found that the page-based sorting was advantageous. When I wanted to ask questions about reverts or word persistence, I needed edits grouped by page and sorted so that I could look important patterns that happened to a page. In other studies, I want to know what users did and what happened to them. I found that I then needed to re-sort the data based on user_info like user_id or IP address. This allowed me to look at user sessions, and the effects of being reverted on user activity.

I'd often load two separate tables into PostgreSQL and then "partition" them based on each sort criteria. This actually puts the raw data on the disk in sorted order so that you can optimize certain query patterns E.g. revision_by_page and revision_by_user. That would let me perform really fast queries that aggregated by page or user. But it had a lot of up-front cost.

In summary: There is no one right way to sort this data. At the very least, there are two good ways: group by user and sort vs. group by page and sort. Sorting by the timestamp of the edit is useful too, but will likely require re-sorting/aggregation for any interesting analysis.

This has implications for how the files are broken apart. E.g. if we produce a new data file for enwiki each month, does it still make sense to sort by page or user or raw timestamp? I'm not sure. I guess it depends on how I might be able to merge new data with old data. i.e., can I take page-grouped, timestamp sorted edits and quickly merge-sort them with a new datafile? If so, then it doesn't much matter how these datafiles are sorted as data consumers will be able to re-process the data quickly and easily as necessary.

So how will data users end up processing this data? Well, they will probably use R and Python judging from the state of data science practice today. They may use traditional RDBMS like MySQL and PostgreSQL. People with access to good resources and operations support may use Hadoop/Spark/etc. but they'll likely be the same people who can already get most of the value they need out of XML dumps. A lot of social scientist-type researchers will probably be using simple Python scripts to sample the data small enough to perform in-memory analysis in a notebook.

@Milimetric @mforns I created a survey and added you both as collaborators. You can make changes in it directly. One place that I need help with is in the introduction to the survey. We need to make it much clearer and exciting there for people to know what data we're talking about. :)

@Halfak thanks for the input. We'll make sure we have it as part of our thinking.

@leila: I looked over the survey and made some edits to the first part. The questions look great to me. Thanks again for putting this together.

@Milimetric thanks! I added two questions at the end: field of research and email address. I changed the deadline to 2019-09-03 and started advertising for it now.

Update: We have 3 responses in the survey and one over email. I sent a reminder to wiki-research-l and an email to analytics. The deadline is in a week from now.

@Milimetric @mforns here is your update:

  • We have 6 responses to the survey. I know of one more which may come in in the next 14 hours. I'll close the survey after that.
  • I think it's best if you review the responses and then let me know if you need my help anywhere. Let me know if this works for you.
  • I have received some notes over email which may apply to data releases beyond this specific one:
    • one page that provides a simple summary of the dumps i.e. that tells you where you can get the data that you would want to get, and also contains a link to the dumped datasets' schemas
    • maybe a bit more intuitive (for an outsider who is not familiar with the wikipedia lingo) naming patters, or (would be equally good) short description of the key words (ex. what's a namespace, template, article; what are the relationships between them; etc.), which would be accessible on the same page that contains the summary of the dumps
    • easy way to unequivocally refer to each entity, and trace it's relation to all data that use Wikimedia dumps (ex. Wikipedia, Wikidata, links, revisions, comments and so on) ideally by an artificial numerical unique ID

Thanks @leila!

I reviewed the survey responses yesterday, both on survey results and comments in the Phabricator task(s).
There were good insights!

I wanted to write a summary in here, but was waiting to see if we received more responses today, as you mentioned.
So, will wait for that and write a final summary and include your notes in the last comment.

@mforns great. (I just checked and all answers that I expected are in.) I'll assign this task to you for now, let me know if I can help somewhere.

Hi all!

Here's a summary of the survey responses. I added the conclusions that we Analytics have drawn, inline. Please, feel free to comment on them. In a later post to this task I'm going to present the final structure/format of the dumps in detail.

How to split up the Wikimedia projects?

Most respondents coincide in that splitting by wiki (project) is the best solution. It allows for users to study specific wikis (or groups of wikis) without having to download/filter/process extra data. Usually one would not be interested in mixing different types of wikis, since they all behave differently. Also, 50% of the respondents mention that they would not mind having to download long lists of small files, provided there's a clear mapping or convention that indicates them where to download from, manually or programmatically. Only one respondent does not endorse the split by wiki (project), but does not argue against it.

We decided to split by individual wiki (not by wiki group), and provide a clear mapping/convention for users to know where to download files from.

How to split up the files over time?

Half of the respondents mention the monthly split as desirable, although I interpret there's overall no strong opinions on the exact interval of the split, others mention the daily, weekly, quarterly and yearly split as valid. A couple responses repeat the argument of not minding lots of smaller files, provided there's a clear structure/mapping that makes the downloads easy.

We decided to split the bigger wikis (wikidatawiki, commonswiki and enwiki) by month, and split medium sized wikis by year (see list of medium-sized wikis on my next post on this task). As for smaller wikis that can fit in a single file, we would not split them at all. It should be clear in the mapping/convention which wikis are split by which time-interval.

What other constraints on your end do you want us to consider when we are making a decision about how to slice the data?

Here, several interesting points were risen:

  • Two respondents asked for us to add the corresponding wikidata item ID to all records of the dump.
  • One person asked that the zip format is supported out of the box by main OSs.
  • And one respondent mentioned the value of having the data sorted (grouped) by user ID and by page ID.

We will add the wikidata item ID to the data set, see: T221890. The zip format for the dumps will be Bzip2, a free software program available in Linux and iOS distributions, and it has a free windows version ( One of bzip2 advantages is it allows to concatenate downloaded files, and treat them as 1 single compressed file. Bzip2 does also have a very high compression rate, thus reducing the size of the compressed files. As for sorting the data by user or page, we understand its value, and are willing to implement that in the future. Now, there are some technical challenges related to this, so for the first version of these dumps we'll generate data sorted/split by timestamp only.

What kind of research or analysis you would like to do with this data?

I could identify, in more than 50% of the responses, fields or associations that are already present in the mediawiki history data set which would aid in the research/analysis described (cool!). One response mentions they would use these dumps in conjunction with other data sources, which can definitely be done. Some responses are very generic, and it is difficult to determine if the format of the dumps would help. But could not see any sign of missing information derived from those responses either. Someone suggests to add Parsoid and ORES data to the data set, to save having to make API calls. And someone mentioned the sorting of the data by user (same request as previous response by different person).

It seems for now that the contents and format of the dumps should fulfill a good part of the respondents' needs. We'll continue to be open to suggestions so that we can improve the schema and format of this dump, once people start using it. I created a Phabricator task (T232843) for the suggestion to add Parsoid and ORES data to this data set.
I also created another task for releasing the dumps sorted by user and page (T232844).

What is your field of research or development?

Responses for this question were interesting, but I think they do not affect the format of the dumps.

So, this is the final format of the MediaWiki history dumps:


The job that generates the dumps will execute once a month, together with the release of each new hive::wmf.mediawiki_history snapshot. Each monthly dump will consist of an updated version of the full MediaWiki history. Note that this does not allow for incremental downloading of the data, as some survey respondents mentioned. This limitation is not ideal, but there are heavy technical reasons that prevent us, for now, to make this data incremental.


The dumps output is versioned by snapshot. This means each month a new folder named after the snapshot (YYYY-MM) will be created with the new updated version of the dumps. Older snapshots will be kept for a couple months, and then will be deleted for storage reasons. We still have not decided how many snapshots we'll keep. We can discuss this. But it will probably be between 3-6?


The data will be partitioned by 2 dimensions: wiki and time. This way users can download data for a wiki (or set of wikis) of their choice. The time split is necessary because of file size reasons. There are 3 different time partitions: monthly, yearly and all-time. Very big wikis will be partitioned monthly, while medium wikis will be partitioned yearly, and small wikis will be dumped in one single file. This way we ensure that files are not larger than ~2GB, and at the same time we prevent generating a very large number of files.

Wikis partitioned monthly: wikidatawiki, commonswiki, enwiki.
Wikis partitioned yearly:
dewiki, frwiki, eswiki, itwiki, ruwiki, jawiki, viwiki, zhwiki, ptwiki, enwiktionary, plwiki, nlwiki, svwiki, metawiki, arwiki, shwiki, cebwiki, mgwiktionary, fawiki, frwiktionary, ukwiki, hewiki, kowiki, srwiki, trwiki, loginwiki, huwiki, cawiki, nowiki, mediawikiwiki, fiwiki, cswiki, idwiki, rowiki, enwikisource, frwikisource, ruwiktionary, dawiki, bgwiki, incubatorwiki, enwikinews, specieswiki, thwiki.
Wikis in one single file: all the others.

File format

The chosen output file format is TSV, to reduce the size of the dumps as much as possible. TSV format does not have meta-data like i.e. JSON format, and even after compression, it is lighter. Also, mediawiki_history data is pretty flat, the only nested fields are arrays of strings, which can be encoded in TSV. The encoding of arrays is the following: array(<value1>,<value2>,...,<valueN>).

The chosen compression algorithm is Bzip2, because it's a widely used free software format and has a high compression rate which makes the dump files smaller. Also, one can concatenate several Bzip2 files and treat them as a single Bzip2 file (in case users need to do that).

Directory tree

The base path in HDFS will be: /wmf/data/archive/mediawiki/history/dumps/
The folder structure will be: <snapshot>/<wiki>/<time_range>.tsv.bz2
Where <snapshot> is YYYY-MM, i.e. 2019-08; and <wiki> is the wiki_db, i.e. enwiki or commonswiki; and <time_range> is either YYYY-MM for big wikis, YYYY for medium wikis, or all-time for the rest.



Can we move the info to wikitech so we can access it easily when this ticket is closed?

this looks great -- thanks @mforns for writing this up so clearly!