Page MenuHomePhabricator

Make HTML dumps available
Closed, ResolvedPublic

Description

Make HTML dumps, similar to wikitext dumps of Wikipedia available.

Why?

  • Templates: The HTML version has all templates expanded, while wikitext doesn't, and at least as much as we know, there is no easy and standard method of expanding templates locally, without having to install a full MediaWiki stack.
  • Frequency: We (researchers inside and outside of WMF) often need to have access to this data.
  • Load: At the moment, we are all hitting the API for getting this data.
  • Efficiency: Hitting the API for getting the data is not efficient. It takes many hours to get the full HTML dumps of a project such as enwiki.

Some recent applications
A variety of research needs this kind of data. To give you a sense, you can see these two last publications where we had to use the API to get the HTML dumps

  • Dimitrov, Dimitar, et al. "What Makes a Link Successful on Wikipedia?." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.
  • Singer, Philipp, et al. "Why We Read Wikipedia." Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2017.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@bd808 I learned from Nuria that we should have a chat with you regarding dumps and this ticket. So, here I am. :) I'm bringing this up now as I know we're all planning for next quarter, and I don't know how much work this ticket is, etc. but I'd like for us to explore it together to see what can (and cannot) be done on this front. If you need more info from us (Research), just ping.

@leila the cloud-services-team (specifically @madhuvishy) is working on T168486: Migrate customer-facing Dumps endpoints to Cloud Services in concert with @ArielGlenn to migrate the public storage and delivery of Dumps (NFS, rsync, and HTTPS endpoints) to a new set of physical hosts that we will be responsible for operating. Generating Dumps is not currently in scope for our team. We are just going to help with the last mile delivery both to take that burden off of @ArielGlenn and to help ensure that our Cloud-Services users get the best access to Dumps data that we can support. Building and operating a system to generate HTML dumps is very much out of scope for the Cloud Services team.

@bd808 thanks for the note. I'll take this offline to discuss who I should reach out to as it's a bit unclear now. :)

@ArielGlenn actually, I overlooked this. Are you the person in charge of this kind of request? :)

This is meant to be (I think) the dumps of content from RESTBase as stored by parsoid. I know I can't get to this before at least February and maybe later than that. I have a script that was mostly ready for my end of things, the restbase piece needed a repo for deployment with all the dependencies etc, so someone would need to take that part on; I'm not sure who would do that on which team any more :-)

This is meant to be (I think) the dumps of content from RESTBase as stored by parsoid. I know I can't get to this before at least February and maybe later than that.

If you can get to it by February, that'd be great. If you can't, it's good to know when it's reasonable to expect for you to have time to get to this. We can set expectations accordingly with the research community.

I have a script that was mostly ready for my end of things, the restbase piece needed a repo for deployment with all the dependencies etc, so someone would need to take that part on; I'm not sure who would do that on which team any more :-)

@DarTar do you know by any chance which team we should ask for help for this part?
@ArielGlenn who was traditionally responsible for this? Maybe we can start from there and try to figure out which team may be responsible for it now?

@leila thanks for pushing this forward. I am supportive of this request, I also believe this might deserve a bit of a broader discussion to understand data consumer needs (rings a bell, @Capt_Swing ?)

Context: the main takeaway I got from talking to the DBpedia folks is that the #1 request from their data consumers – other than RDF – is plain text (with all HTML and markup stripped). There's also a longstanding request for dumps in JSON format. In other words, while it's clear that the dumps are primarily consumed by researchers nowadays, I don't have a good sense of what the respective priority should be regarding different productized formats. It sounds like HTML would meet a number of use cases.

Regarding ownership, I'd imagine this work would live at the intersection of Services and Analytics but AFAIK it's unplanned work so it deserves at the very least a detailed discussion within Tech Mgmt.

@DarTar neither analytics nor cloud team would be working on this, see above. This work is on @ArielGlenn 's backlog who is the ops engineer that supports dumps.

Also, I think once dumps infrastructure is migrated to cloud "labs" probably @bmansurov can team up with @ArielGlenn and that might be the fastest way to get this done?

The migration to cloud was done i think now.

See also: https://phabricator.wikimedia.org/T210413 for maintaining an IPFS-friendly dump (which could be generated from an unzipped HTML dump)

Cervisiarius brought this task up last week and mentioned that he and a master student of his are working on: 1. putting together the system for generating HTML dumps, 2. they will release (a one time effort) html dumps of the most recent dumps by the time they're done. The latter can unblock/encourage quite a few researchers, while the former can be picked up by WMF and scaled when WMF is read.

I'm moving this task to In-progress on Research board based on this recent change. I will keep the task assigned to me as Cervisiarius doesn't regularly check Phabricator.

These are full html of the pages or 'just' of the parsed/rendered wikitext, or...? And, is there a notion of what the code looks like or what components are involved, so we can know if it can be folded into our infrastructure? Note there is draft code already for parsed/rendered wikitext, as pulled from Restbase.

In any case it would be great to have a one-off copy.

This is such an interesting ticket cluster. Subscribed!

Out of curiosity, what's your use case for them, @EvanProdromou ? Not that I can get back to them any time soon :-(

Hi all, we are working on this! We converted the full history of Wikipedia (up to March 2019) from Wikitext to HTML and we released the dataset on Internet Archive.

We are finalizing the documentation and planning to publish an article with all the details in the next weeks. One important aspect of this dataset is that we tried to generate the HTML by reproducing as much as possible how the page looked when the revision was created: i.e. for each template existing on the page, we used the version available at the revision creation.

The full conversion in HTML it's quite expensive and took 42 days on 4 big servers with 48 cores each.

For now, you can find the full archive here: https://archive.org/details/enwiki_history_html
and the repository with the some (still partial!) instructions here: https://github.com/epfl-dlab/enwiki_history_to_html

Let me know if you have any suggestions or feedback!

This is great news! We would be happy to link to it and host a copy once it's ready to be announced. What is the cumulative size of the files for download?

It's 7TB compressed in gzip. The articles are partitioned in JSON files with 1000 revisions each (one per line) and we will share an index (+ download script) to know where to find a specific article.

Let me add @Bstorm to make sure she knows I've volunteered us to host a copy and to make sure that there's 7T spare around, since that's more than I expected.

@tizianopiccardi thanks for the update and great to see that you're there. :) Please make sure you advertise for it on wiki-research-l when the right time arrives.

@ArielGlenn storing a copy on our end sounds good. How can we go about keeping the data refreshed on our end? This is a one-time effort by tizianopiccardi and colleagues. Can we take the code from them and have a schedule for releasing HTML dumps the same way we do XML dumps?

@tizianopiccardi thanks for the update and great to see that you're there. :) Please make sure you advertise for it on wiki-research-l when the right time arrives.

@ArielGlenn storing a copy on our end sounds good. How can we go about keeping the data refreshed on our end? This is a one-time effort by tizianopiccardi and colleagues. Can we take the code from them and have a schedule for releasing HTML dumps the same way we do XML dumps?

No, there is a ticket for dumping parsed wikitext as it is stored in RESTBase but that's not a full page view with skin etc. Their code wouldn't help here; we're talking about rendering each page and saving it and that's prohibitively slow. I presume they used MediaWiki to do this, in fact.

Yes, we used MediaWiki. Technically we extracted the function used for the parsing and we intercepted the database calls to return the version of the templates available at the revision creation. In practice, it is as if we called /TitleABC?action=render as soon as the revision was created.

I think that to keep this dataset up-to-date, it would be faster to just set a trigger and store a copy every time there is a new edit.

leila removed leila as the assignee of this task.Dec 4 2019, 4:17 AM

@tizianopiccardi thanks for the update and great to see that you're there. :) Please make sure you advertise for it on wiki-research-l when the right time arrives.

@ArielGlenn storing a copy on our end sounds good. How can we go about keeping the data refreshed on our end? This is a one-time effort by tizianopiccardi and colleagues. Can we take the code from them and have a schedule for releasing HTML dumps the same way we do XML dumps?

No, there is a ticket for dumping parsed wikitext as it is stored in RESTBase but that's not a full page view with skin etc. Their code wouldn't help here; we're talking about rendering each page and saving it and that's prohibitively slow. I presume they used MediaWiki to do this, in fact.

understood. My updated question is: What does it take to make this task happen? (and please do let me know if you're not the right person for me to pose this question to. :)

...

No, there is a ticket for dumping parsed wikitext as it is stored in RESTBase but that's not a full page view with skin etc.

So I've examined the provided dumps and I'm wrong about what's in them. They contain the parsed wikitext and nothing else, so this could be gotten from RESTBase.

Yes, we used MediaWiki. Technically we extracted the function used for the parsing and we intercepted the database calls to return the version of the templates available at the revision creation. In practice, it is as if we called /TitleABC?action=render as soon as the revision was created.

I think that to keep this dataset up-to-date, it would be faster to just set a trigger and store a copy every time there is a new edit.

This wouldn't work out, because an edit to a template can trigger changes in the HTML to hundreds of thousands of pages, each of which would need to be re-rendered.

...

understood. My updated question is: What does it take to make this task happen? (and please do let me know if you're not the right person for me to pose this question to. :)

The naive approach is to dump everything in RESTBase for each wiki and publish it, which is what my initial scripts do. There are a few problems with this:

  • we probably can't support queries at the rate of speed required to get dumps out the door in a timely regular fashion.
  • there is no version or other metadata stored with the HTML that I could keep and compare, getting only new renders since the last dump.
  • not every page renders because some pages are just too big; their HTML will not land in RESTBase, and will be missing from the dumps.

I'm talking with a few folks now about how we might do something via direct access to Cassandra.

My immediate work is almost always "keep dumps from breaking with MW deploys" and "make them faster because we have more and more content", which means work on this is at a crawl, for which I apologize...

Okay, I have had a chat with one of the core platform folks about the future of RESTBase. TL:DR, it's going away next quarter (Jan-Mar 2020)! It will be replaced by some caching service or other, TBD. I've subscribed to the appropriate ticket (T239743 if no better one comes along), and we'll see what the plans are and whether easy bulk internal access can be negotiated in. Cassandra itself does not lend it to such things It's also not even clear if prerendering on edit will happen in all cases; for example, bots may not need a text preview and may not request the rendered text after edit, so skipping prerendering in these cases might save load on the servers.

I noticed that the criteria listed in the "why?" of this task could also be satisfied by simply expanding the wikitext without fully rendering. For example:

https://en.wikipedia.org/w/api.php?action=expandtemplates&text={{Barack%20Obama}}

There are plenty of reasons that one might prefer an HTML rendering, to avoid Wikitext syntax entirely, use off-the-shelf HTML analysis tools, etc. But I thought I would mention the expanded templates intermediate format in case there are people who would prefer analyzing the wikitext. Also to make the point that the "why?" should probably include a few motivations for specifically desiring HTML.

@ArielGlenn thanks for engaging on this ticket further. From the current conversation, it doesn't look to me as we can have a solution for having HTML dumps on a regular basis. I understand the capacity limitations on this front. I'll leave this task open for a future where we may be able to tackle it.

@leila I still really want these to happen. As RESTbasse moves towards being phased out I'm trying to have the discussion about access to its replacement and how we might keep bulk access for dumps in mind. But it's going to need a lot of thought yet.

For context: I asked @fkaelin to pick this task up as it remains a high priority request from the research community to our team and we seem to have enough in place to be able to make an MVP release happen.

@leila for clarification, do you need HTML of all revisions of all pages, or only the current version of page content?

@ArielGlenn, the dataset should contain the rendered html for all revisions, rendered with the mediawiki version at the time the revision was created. The motivation for this is described in @tizianopiccardi's paper.

The approach that we discussed which lead to revisiting this task is the following:

  • deploy a streaming pipeline on the analytics infrastructure which
    • consumes from the mediawiki.revision-create kafka topic
    • retrieves the rendered html for that revision from a service (tbd from api, restBase, or caching layer)
    • stores html in a schema along with other fields, ie like mediawiki history but with html instead of wikitext
  • data is stored in hdfs using a directory structure like /wiki_db=enwiki/year=2020/month=12/day=24/hour=15/***.avro, ie not snapshot based
  • once this pipeline is in production, we backfill historical data using the existing WikiHist dataset and tool

Caveats:

  • The published wikihist dataset is for the March 2019 snapshot of the English wikipedia. For the period from March 19 until the production pipeline is deployed, as well as all other wikepedias we want to include, the backfill needs to use the WikiHist tool. The engineering effort required for this has not been estimated.
  • The revision-create event stream is currently not guaranteed to be equivalent to the mediawiki history, some data is missing
  • How to make this data available publicly? While the data is public at the time of creation, ie we could publish it hourly theoretically, a mechanism to deal with deleted/hidden content is required. I have to learn more about how we do this for the wikitext snapshot dumps, I hope we could replicate the approach. Related: T262479
  • There are other initiatives, e.g. Okapi, that aim to do something similar, more coordination is warranted

@ArielGlenn, the dataset should contain the rendered html for all revisions, rendered with the mediawiki version at the time the revision was created. The motivation for this is described in @tizianopiccardi's paper.

In that case the OKAPI dumps will not be helpful, as they contain the current revision only.

The approach that we discussed which lead to revisiting this task is the following:

  • deploy a streaming pipeline on the analytics infrastructure which
    • consumes from the mediawiki.revision-create kafka topic
    • retrieves the rendered html for that revision from a service (tbd from api, restBase, or caching layer)
    • stores html in a schema along with other fields, ie like mediawiki history but with html instead of wikitext
  • data is stored in hdfs using a directory structure like /wiki_db=enwiki/year=2020/month=12/day=24/hour=15/***.avro, ie not snapshot based
  • once this pipeline is in production, we backfill historical data using the existing WikiHist dataset and tool

Note that as templates and Lua modules are edited, the HTML of the pages that uses those changes, which means changes to the HTML of older revisions of those pages as well. I don't have a good proposal for how to address that, only a bad one: reparse the old pages as needed.

Caveats:

  • The published wikihist dataset is for the March 2019 snapshot of the English wikipedia. For the period from March 19 until the production pipeline is deployed, as well as all other wikepedias we want to include, the backfill needs to use the WikiHist tool. The engineering effort required for this has not been estimated.
  • The revision-create event stream is currently not guaranteed to be equivalent to the mediawiki history, some data is missing
  • How to make this data available publicly? While the data is public at the time of creation, ie we could publish it hourly theoretically, a mechanism to deal with deleted/hidden content is required. I have to learn more about how we do this for the wikitext snapshot dumps, I hope we could replicate the approach. Related: T262479

The way we manage for the wikitext xml/sql dumps is by regenerating the metadata for all visible pages and revisions at the time, and copying only the old revision data that corresponds to these visible pags and revisions.

  • There are other initiatives, e.g. Okapi, that aim to do something similar, more coordination is warranted

Perhaps they would be interested in providing full revision dumps in HTML but that would likely be a long ways off. Their remit is somewhat different from ours/yours.

Hi @fkaelin - it's nice to meet you, sounds like there are a lot of overlaps in your thinking and ours. On Okapi, in general, we are working on some things that may be relevant as well as others that may not be.

I can set up some time to discuss your initiative and see how we can help. To @ArielGlenn's point, we are focused on current revisions, however, currently we create a dump everyday in html, wikitext, and html/wikitext (JSON) [for every project (text-based) and language] that will be hosted on their dumps page bi-weekly. We're in the early stages of scoping that out. Thanks @bd808 for adding this ticket tracking that.

To summarize my understanding:

  • for research, the html history is interesting because it expands templates and lua modules
  • for a revision of page p created at time t, we prefer to store the html that a reader was served at that time (ie what WikiHist does), rather than the html using the version of the templates at some time in the future (ie by calling the mediawiki api during an batch export)
  • however, if at time t+1 a template that is used by page p changed, then the reader was served a different html on wikipedia but there is not any revision for page p. Only once there is an new revision for page p at time t+2 will the change of the template at time t+1 be reflected in the history. In fact page p is not edited after time t, the template change will never be reflected in the html history of the page.

When considering the streaming mode, the html generated by the mediawiki api is using the current version of template, eliminating the problem of which template version to use. This leaves the problem of how to handle template updates, which indeed seems quite daunting. It would be interesting to get some stastics about how many pages are affected by template updates, a histogram of how long pages 'spend' being updated via template before being edited again directly, etc. Is there any previous work in this direction?

@RBrounley_WMF, nice to meet you too. This task does seem very related to what Okapi is looking to do, thank you for setting up a conversation.

Templates are stored in wikitext (right...are they?). If so, I wonder if mediawiki history would help you here. I know that @JAllemandou has talked about wanting to attach revision content to MW history too, but I'm not sure how viable that is. Perhaps we could attach html too...and maybe even parse it using template content from the history?? Ok yes that does sound daunting :)

Templates are stored in wikitext (right...are they?). If so, I wonder if mediawiki history would help you here. I know that @JAllemandou has talked about wanting to attach revision content to MW history too, but I'm not sure how viable that is. Perhaps we could attach html too...and maybe even parse it using template content from the history?? Ok yes that does sound daunting :)

Basically what you need all the way down the graph is additional metadata recording the rev_id of each transcluded item. In theory this is something MediaWiki could start tracking, possibly using a "slot" for storage. BUT the spiral goes further down because you would also need to version any scrubunto modules that were used. And if the desired end goal includes visual parity, you probably also have to factor in tracking of site.js and site.css.

if you're going to do site js/css then might as well do vector js/css as well.
and this may be going too far but some other things to consider:

  • vector wasn't always the default so also js/css for earlier defaults
  • past versions of wmf-config
  • past mediawiki wmf deployment branches

even then you can't get everything, there was a time before phase3.

a couple more things

  • magic words that return date/time should return an old date/time
  • make sure that articles that exist now are red links if they didn't exist back then

This has been requested by the kiwix team multiple times over the years. Hopefully this would be parsoid-format HTML dumps.

Are those enterprise dumps enough for kiwix? @leila What else is needed?

@fkaelin: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

@fkaelin as mentioned by cscott earlier, Enterprise now publishes HTML dumps and we have a parser to process it (blog post). The initial intention that I had when creating this task is resolved by these two changes. I'd like to resolve this task for that reason. Am I missing something? (fwiw: I understand work is never done on this front. For example, we ideally should have HTML dumps in Wikimedia Foundation's data infrastructure. That needs a separate task imo.)

I agree @leila, we can close this as resolved.

There are already follow-up tasks for historical dumps (T333419) and making them available on DE infra (T305688)