Make HTML dumps, similar to wikitext dumps of Wikipedia available.
Why?
- Templates: The HTML version has all templates expanded, while wikitext doesn't, and at least as much as we know, there is no easy and standard method of expanding templates locally, without having to install a full MediaWiki stack.
- Frequency: We (researchers inside and outside of WMF) often need to have access to this data.
- Load: At the moment, we are all hitting the API for getting this data.
- Efficiency: Hitting the API for getting the data is not efficient. It takes many hours to get the full HTML dumps of a project such as enwiki.
Some recent applications
A variety of research needs this kind of data. To give you a sense, you can see these two last publications where we had to use the API to get the HTML dumps
- Dimitrov, Dimitar, et al. "What Makes a Link Successful on Wikipedia?." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
- Singer, Philipp, et al. "Why We Read Wikipedia." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.