Page MenuHomePhabricator

Make HTML dumps available
Open, MediumPublic

Description

Make HTML dumps, similar to wikitext dumps of Wikipedia available.

Why?

  • Templates: The HTML version has all templates expanded, while wikitext doesn't, and at least as much as we know, there is no easy and standard method of expanding templates locally, without having to install a full MediaWiki stack.
  • Frequency: We (researchers inside and outside of WMF) often need to have access to this data.
  • Load: At the moment, we are all hitting the API for getting this data.
  • Efficiency: Hitting the API for getting the data is not efficient. It takes many hours to get the full HTML dumps of a project such as enwiki.

Some recent applications
A variety of research needs this kind of data. To give you a sense, you can see these two last publications where we had to use the API to get the HTML dumps

  • Dimitrov, Dimitar, et al. "What Makes a Link Successful on Wikipedia?." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
  • Singer, Philipp, et al. "Why We Read Wikipedia." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

Event Timeline

leila created this task.Dec 7 2017, 8:17 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 7 2017, 8:17 PM
leila claimed this task.Dec 7 2017, 8:21 PM
leila updated the task description. (Show Details)

I've assigned this task to myself until the description of the task is completed. Once that's done, the task should be passed to Analytics, most likely. :)

leila triaged this task as Medium priority.Dec 7 2017, 8:22 PM
leila moved this task from Staged to In Progress on the Research board.

One reason why having the HTML dump (as opposed to plain wikitext) would be very useful is because HTML has all templates expanded, while wikitext doesn't, and I don't know of any easy and standard method of expanding templates locally, without having to install a full MediaWiki stack.

Agreed, template extension was the main reason we used HTMLs from the API also in our research. In general, when investigating viewing behavior in Wikipedia, looking at the HTML is also just closest to what readers actually see.

Two recent publications, where we acquired all article HTMLs for the English Wikipedia via API:

  • Dimitrov, Dimitar, et al. "What Makes a Link Successful on Wikipedia?." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
  • Singer, Philipp, et al. "Why We Read Wikipedia." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

There is also wide range of alternative parsers available(https://www.mediawiki.org/wiki/Alternative_parsers), but neither seemed to be full featured, easy to setup, and practical for use at large scale (at least not when our PhD checked, which was like 1 year ago)

leila removed leila as the assignee of this task.Dec 8 2017, 6:26 PM
leila updated the task description. (Show Details)
leila updated the task description. (Show Details)
leila added a project: Analytics.
leila added a subscriber: bd808.Dec 8 2017, 6:39 PM

@bd808 I learned from Nuria that we should have a chat with you regarding dumps and this ticket. So, here I am. :) I'm bringing this up now as I know we're all planning for next quarter, and I don't know how much work this ticket is, etc. but I'd like for us to explore it together to see what can (and cannot) be done on this front. If you need more info from us (Research), just ping.

@leila the cloud-services-team (specifically @madhuvishy) is working on T168486: Migrate customer-facing Dumps endpoints to Cloud Services in concert with @ArielGlenn to migrate the public storage and delivery of Dumps (NFS, rsync, and HTTPS endpoints) to a new set of physical hosts that we will be responsible for operating. Generating Dumps is not currently in scope for our team. We are just going to help with the last mile delivery both to take that burden off of @ArielGlenn and to help ensure that our Cloud-Services users get the best access to Dumps data that we can support. Building and operating a system to generate HTML dumps is very much out of scope for the Cloud Services team.

fdans moved this task from Incoming to Radar on the Analytics board.Dec 11 2017, 4:51 PM
leila added a comment.Dec 11 2017, 5:36 PM

@bd808 thanks for the note. I'll take this offline to discuss who I should reach out to as it's a bit unclear now. :)

leila added a comment.Dec 11 2017, 6:00 PM

@ArielGlenn actually, I overlooked this. Are you the person in charge of this kind of request? :)

This is meant to be (I think) the dumps of content from RESTBase as stored by parsoid. I know I can't get to this before at least February and maybe later than that. I have a script that was mostly ready for my end of things, the restbase piece needed a repo for deployment with all the dependencies etc, so someone would need to take that part on; I'm not sure who would do that on which team any more :-)

This is meant to be (I think) the dumps of content from RESTBase as stored by parsoid. I know I can't get to this before at least February and maybe later than that.

If you can get to it by February, that'd be great. If you can't, it's good to know when it's reasonable to expect for you to have time to get to this. We can set expectations accordingly with the research community.

I have a script that was mostly ready for my end of things, the restbase piece needed a repo for deployment with all the dependencies etc, so someone would need to take that part on; I'm not sure who would do that on which team any more :-)

@DarTar do you know by any chance which team we should ask for help for this part?
@ArielGlenn who was traditionally responsible for this? Maybe we can start from there and try to figure out which team may be responsible for it now?

leila claimed this task.Jan 6 2018, 7:14 PM

@leila thanks for pushing this forward. I am supportive of this request, I also believe this might deserve a bit of a broader discussion to understand data consumer needs (rings a bell, @Capt_Swing ?)

Context: the main takeaway I got from talking to the DBpedia folks is that the #1 request from their data consumers – other than RDF – is plain text (with all HTML and markup stripped). There's also a longstanding request for dumps in JSON format. In other words, while it's clear that the dumps are primarily consumed by researchers nowadays, I don't have a good sense of what the respective priority should be regarding different productized formats. It sounds like HTML would meet a number of use cases.

Regarding ownership, I'd imagine this work would live at the intersection of Services and Analytics but AFAIK it's unplanned work so it deserves at the very least a detailed discussion within Tech Mgmt.

Nuria added a subscriber: Nuria.Jan 8 2018, 6:52 PM

@DarTar neither analytics nor cloud team would be working on this, see above. This work is on @ArielGlenn 's backlog who is the ops engineer that supports dumps.

Nuria added a subscriber: bmansurov.Jan 8 2018, 8:44 PM

Also, I think once dumps infrastructure is migrated to cloud "labs" probably @bmansurov can team up with @ArielGlenn and that might be the fastest way to get this done?

leila moved this task from In Progress to Staged on the Research board.Mar 15 2018, 9:00 PM

The migration to cloud was done i think now.

Sj added a subscriber: Sj.Nov 26 2018, 6:42 PM

See also: https://phabricator.wikimedia.org/T210413 for maintaining an IPFS-friendly dump (which could be generated from an unzipped HTML dump)

leila added a comment.Jan 14 2019, 1:11 PM

Cervisiarius brought this task up last week and mentioned that he and a master student of his are working on: 1. putting together the system for generating HTML dumps, 2. they will release (a one time effort) html dumps of the most recent dumps by the time they're done. The latter can unblock/encourage quite a few researchers, while the former can be picked up by WMF and scaled when WMF is read.

I'm moving this task to In-progress on Research board based on this recent change. I will keep the task assigned to me as Cervisiarius doesn't regularly check Phabricator.

leila moved this task from Staged to In Progress on the Research board.Jan 14 2019, 1:11 PM

These are full html of the pages or 'just' of the parsed/rendered wikitext, or...? And, is there a notion of what the code looks like or what components are involved, so we can know if it can be folded into our infrastructure? Note there is draft code already for parsed/rendered wikitext, as pulled from Restbase.

In any case it would be great to have a one-off copy.

leila moved this task from In Progress to Staged on the Research board.Apr 1 2019, 11:37 PM

This is such an interesting ticket cluster. Subscribed!

Out of curiosity, what's your use case for them, @EvanProdromou ? Not that I can get back to them any time soon :-(

leila edited projects, added Research-Backlog; removed Research.Jul 11 2019, 3:47 PM

Hi all, we are working on this! We converted the full history of Wikipedia (up to March 2019) from Wikitext to HTML and we released the dataset on Internet Archive.

We are finalizing the documentation and planning to publish an article with all the details in the next weeks. One important aspect of this dataset is that we tried to generate the HTML by reproducing as much as possible how the page looked when the revision was created: i.e. for each template existing on the page, we used the version available at the revision creation.

The full conversion in HTML it's quite expensive and took 42 days on 4 big servers with 48 cores each.

For now, you can find the full archive here: https://archive.org/details/enwiki_history_html
and the repository with the some (still partial!) instructions here: https://github.com/epfl-dlab/enwiki_history_to_html

Let me know if you have any suggestions or feedback!

This is great news! We would be happy to link to it and host a copy once it's ready to be announced. What is the cumulative size of the files for download?

It's 7TB compressed in gzip. The articles are partitioned in JSON files with 1000 revisions each (one per line) and we will share an index (+ download script) to know where to find a specific article.

Let me add @Bstorm to make sure she knows I've volunteered us to host a copy and to make sure that there's 7T spare around, since that's more than I expected.

leila added a comment.Dec 3 2019, 3:05 PM

@tizianopiccardi thanks for the update and great to see that you're there. :) Please make sure you advertise for it on wiki-research-l when the right time arrives.

@ArielGlenn storing a copy on our end sounds good. How can we go about keeping the data refreshed on our end? This is a one-time effort by tizianopiccardi and colleagues. Can we take the code from them and have a schedule for releasing HTML dumps the same way we do XML dumps?

@tizianopiccardi thanks for the update and great to see that you're there. :) Please make sure you advertise for it on wiki-research-l when the right time arrives.

@ArielGlenn storing a copy on our end sounds good. How can we go about keeping the data refreshed on our end? This is a one-time effort by tizianopiccardi and colleagues. Can we take the code from them and have a schedule for releasing HTML dumps the same way we do XML dumps?

No, there is a ticket for dumping parsed wikitext as it is stored in RESTBase but that's not a full page view with skin etc. Their code wouldn't help here; we're talking about rendering each page and saving it and that's prohibitively slow. I presume they used MediaWiki to do this, in fact.

Yes, we used MediaWiki. Technically we extracted the function used for the parsing and we intercepted the database calls to return the version of the templates available at the revision creation. In practice, it is as if we called /TitleABC?action=render as soon as the revision was created.

I think that to keep this dataset up-to-date, it would be faster to just set a trigger and store a copy every time there is a new edit.

leila removed leila as the assignee of this task.Dec 4 2019, 4:17 AM

@tizianopiccardi thanks for the update and great to see that you're there. :) Please make sure you advertise for it on wiki-research-l when the right time arrives.

@ArielGlenn storing a copy on our end sounds good. How can we go about keeping the data refreshed on our end? This is a one-time effort by tizianopiccardi and colleagues. Can we take the code from them and have a schedule for releasing HTML dumps the same way we do XML dumps?

No, there is a ticket for dumping parsed wikitext as it is stored in RESTBase but that's not a full page view with skin etc. Their code wouldn't help here; we're talking about rendering each page and saving it and that's prohibitively slow. I presume they used MediaWiki to do this, in fact.

understood. My updated question is: What does it take to make this task happen? (and please do let me know if you're not the right person for me to pose this question to. :)

...

No, there is a ticket for dumping parsed wikitext as it is stored in RESTBase but that's not a full page view with skin etc.

So I've examined the provided dumps and I'm wrong about what's in them. They contain the parsed wikitext and nothing else, so this could be gotten from RESTBase.

Yes, we used MediaWiki. Technically we extracted the function used for the parsing and we intercepted the database calls to return the version of the templates available at the revision creation. In practice, it is as if we called /TitleABC?action=render as soon as the revision was created.

I think that to keep this dataset up-to-date, it would be faster to just set a trigger and store a copy every time there is a new edit.

This wouldn't work out, because an edit to a template can trigger changes in the HTML to hundreds of thousands of pages, each of which would need to be re-rendered.

...

understood. My updated question is: What does it take to make this task happen? (and please do let me know if you're not the right person for me to pose this question to. :)

The naive approach is to dump everything in RESTBase for each wiki and publish it, which is what my initial scripts do. There are a few problems with this:

  • we probably can't support queries at the rate of speed required to get dumps out the door in a timely regular fashion.
  • there is no version or other metadata stored with the HTML that I could keep and compare, getting only new renders since the last dump.
  • not every page renders because some pages are just too big; their HTML will not land in RESTBase, and will be missing from the dumps.

I'm talking with a few folks now about how we might do something via direct access to Cassandra.

My immediate work is almost always "keep dumps from breaking with MW deploys" and "make them faster because we have more and more content", which means work on this is at a crawl, for which I apologize...

Okay, I have had a chat with one of the core platform folks about the future of RESTBase. TL:DR, it's going away next quarter (Jan-Mar 2020)! It will be replaced by some caching service or other, TBD. I've subscribed to the appropriate ticket (T239743 if no better one comes along), and we'll see what the plans are and whether easy bulk internal access can be negotiated in. Cassandra itself does not lend it to such things It's also not even clear if prerendering on edit will happen in all cases; for example, bots may not need a text preview and may not request the rendered text after edit, so skipping prerendering in these cases might save load on the servers.

awight added a subscriber: awight.Jan 7 2020, 1:30 PM
awight added a comment.Jan 7 2020, 1:39 PM

I noticed that the criteria listed in the "why?" of this task could also be satisfied by simply expanding the wikitext without fully rendering. For example:

https://en.wikipedia.org/w/api.php?action=expandtemplates&text={{Barack%20Obama}}

There are plenty of reasons that one might prefer an HTML rendering, to avoid Wikitext syntax entirely, use off-the-shelf HTML analysis tools, etc. But I thought I would mention the expanded templates intermediate format in case there are people who would prefer analyzing the wikitext. Also to make the point that the "why?" should probably include a few motivations for specifically desiring HTML.

leila added a comment.Jan 8 2020, 11:03 PM

@ArielGlenn thanks for engaging on this ticket further. From the current conversation, it doesn't look to me as we can have a solution for having HTML dumps on a regular basis. I understand the capacity limitations on this front. I'll leave this task open for a future where we may be able to tackle it.

@leila I still really want these to happen. As RESTbasse moves towards being phased out I'm trying to have the discussion about access to its replacement and how we might keep bulk access for dumps in mind. But it's going to need a lot of thought yet.

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM