Page MenuHomePhabricator

Make HTML dumps available
Open, NormalPublic

Description

Make HTML dumps, similar to wikitext dumps of Wikipedia available.

Why?

  • Templates: The HTML version has all templates expanded, while wikitext doesn't, and at least as much as we know, there is no easy and standard method of expanding templates locally, without having to install a full MediaWiki stack.
  • Frequency: We (researchers inside and outside of WMF) often need to have access to this data.
  • Load: At the moment, we are all hitting the API for getting this data.
  • Efficiency: Hitting the API for getting the data is not efficient. It takes many hours to get the full HTML dumps of a project such as enwiki.

Some recent applications
A variety of research needs this kind of data. To give you a sense, you can see these two last publications where we had to use the API to get the HTML dumps

  • Dimitrov, Dimitar, et al. "What Makes a Link Successful on Wikipedia?." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
  • Singer, Philipp, et al. "Why We Read Wikipedia." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

Event Timeline

leila created this task.Dec 7 2017, 8:17 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 7 2017, 8:17 PM
leila claimed this task.Dec 7 2017, 8:21 PM
leila updated the task description. (Show Details)

I've assigned this task to myself until the description of the task is completed. Once that's done, the task should be passed to Analytics, most likely. :)

leila triaged this task as Normal priority.Dec 7 2017, 8:22 PM
leila moved this task from Staged to In Progress on the Research board.

One reason why having the HTML dump (as opposed to plain wikitext) would be very useful is because HTML has all templates expanded, while wikitext doesn't, and I don't know of any easy and standard method of expanding templates locally, without having to install a full MediaWiki stack.

Agreed, template extension was the main reason we used HTMLs from the API also in our research. In general, when investigating viewing behavior in Wikipedia, looking at the HTML is also just closest to what readers actually see.

Two recent publications, where we acquired all article HTMLs for the English Wikipedia via API:

  • Dimitrov, Dimitar, et al. "What Makes a Link Successful on Wikipedia?." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
  • Singer, Philipp, et al. "Why We Read Wikipedia." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

There is also wide range of alternative parsers available(https://www.mediawiki.org/wiki/Alternative_parsers), but neither seemed to be full featured, easy to setup, and practical for use at large scale (at least not when our PhD checked, which was like 1 year ago)

leila removed leila as the assignee of this task.Dec 8 2017, 6:26 PM
leila updated the task description. (Show Details)
leila updated the task description. (Show Details)
leila added a project: Analytics.
leila added a subscriber: bd808.Dec 8 2017, 6:39 PM

@bd808 I learned from Nuria that we should have a chat with you regarding dumps and this ticket. So, here I am. :) I'm bringing this up now as I know we're all planning for next quarter, and I don't know how much work this ticket is, etc. but I'd like for us to explore it together to see what can (and cannot) be done on this front. If you need more info from us (Research), just ping.

@leila the cloud-services-team (specifically @madhuvishy) is working on T168486: Migrate customer-facing Dumps endpoints to Cloud Services in concert with @ArielGlenn to migrate the public storage and delivery of Dumps (NFS, rsync, and HTTPS endpoints) to a new set of physical hosts that we will be responsible for operating. Generating Dumps is not currently in scope for our team. We are just going to help with the last mile delivery both to take that burden off of @ArielGlenn and to help ensure that our Cloud-Services users get the best access to Dumps data that we can support. Building and operating a system to generate HTML dumps is very much out of scope for the Cloud Services team.

fdans moved this task from Incoming to Radar on the Analytics board.Dec 11 2017, 4:51 PM
leila added a comment.Dec 11 2017, 5:36 PM

@bd808 thanks for the note. I'll take this offline to discuss who I should reach out to as it's a bit unclear now. :)

leila added a comment.Dec 11 2017, 6:00 PM

@ArielGlenn actually, I overlooked this. Are you the person in charge of this kind of request? :)

This is meant to be (I think) the dumps of content from RESTBase as stored by parsoid. I know I can't get to this before at least February and maybe later than that. I have a script that was mostly ready for my end of things, the restbase piece needed a repo for deployment with all the dependencies etc, so someone would need to take that part on; I'm not sure who would do that on which team any more :-)

This is meant to be (I think) the dumps of content from RESTBase as stored by parsoid. I know I can't get to this before at least February and maybe later than that.

If you can get to it by February, that'd be great. If you can't, it's good to know when it's reasonable to expect for you to have time to get to this. We can set expectations accordingly with the research community.

I have a script that was mostly ready for my end of things, the restbase piece needed a repo for deployment with all the dependencies etc, so someone would need to take that part on; I'm not sure who would do that on which team any more :-)

@DarTar do you know by any chance which team we should ask for help for this part?
@ArielGlenn who was traditionally responsible for this? Maybe we can start from there and try to figure out which team may be responsible for it now?

leila claimed this task.Jan 6 2018, 7:14 PM

@leila thanks for pushing this forward. I am supportive of this request, I also believe this might deserve a bit of a broader discussion to understand data consumer needs (rings a bell, @Capt_Swing ?)

Context: the main takeaway I got from talking to the DBpedia folks is that the #1 request from their data consumers – other than RDF – is plain text (with all HTML and markup stripped). There's also a longstanding request for dumps in JSON format. In other words, while it's clear that the dumps are primarily consumed by researchers nowadays, I don't have a good sense of what the respective priority should be regarding different productized formats. It sounds like HTML would meet a number of use cases.

Regarding ownership, I'd imagine this work would live at the intersection of Services and Analytics but AFAIK it's unplanned work so it deserves at the very least a detailed discussion within Tech Mgmt.

Nuria added a subscriber: Nuria.Jan 8 2018, 6:52 PM

@DarTar neither analytics nor cloud team would be working on this, see above. This work is on @ArielGlenn 's backlog who is the ops engineer that supports dumps.

Nuria added a subscriber: bmansurov.Jan 8 2018, 8:44 PM

Also, I think once dumps infrastructure is migrated to cloud "labs" probably @bmansurov can team up with @ArielGlenn and that might be the fastest way to get this done?

leila moved this task from In Progress to Staged on the Research board.Mar 15 2018, 9:00 PM

The migration to cloud was done i think now.

Sj added a subscriber: Sj.Nov 26 2018, 6:42 PM

See also: https://phabricator.wikimedia.org/T210413 for maintaining an IPFS-friendly dump (which could be generated from an unzipped HTML dump)

leila added a comment.Jan 14 2019, 1:11 PM

Cervisiarius brought this task up last week and mentioned that he and a master student of his are working on: 1. putting together the system for generating HTML dumps, 2. they will release (a one time effort) html dumps of the most recent dumps by the time they're done. The latter can unblock/encourage quite a few researchers, while the former can be picked up by WMF and scaled when WMF is read.

I'm moving this task to In-progress on Research board based on this recent change. I will keep the task assigned to me as Cervisiarius doesn't regularly check Phabricator.

leila moved this task from Staged to In Progress on the Research board.Jan 14 2019, 1:11 PM

These are full html of the pages or 'just' of the parsed/rendered wikitext, or...? And, is there a notion of what the code looks like or what components are involved, so we can know if it can be folded into our infrastructure? Note there is draft code already for parsed/rendered wikitext, as pulled from Restbase.

In any case it would be great to have a one-off copy.

leila moved this task from In Progress to Staged on the Research board.Apr 1 2019, 11:37 PM

This is such an interesting ticket cluster. Subscribed!

Out of curiosity, what's your use case for them, @EvanProdromou ? Not that I can get back to them any time soon :-(

leila edited projects, added Research-Backlog; removed Research.Jul 11 2019, 3:47 PM