Page MenuHomePhabricator

Support for rendering to HTML of pages as stored in Wikipedia dumps
Open, Needs TriagePublicFeature

Description

Feature summary:

Wikipedia is available as regular XML dump of pages. But converting that XML into HTML is currently not trivial. There are currently two options:

  • You import the XML dump into local Mediawiki install and then point Parsoid to it when rendering pages (or simple save HTML pages as rendered by Mediawiki install). This requires substantial resources for large Wikipedia instances (e.g., English Wikipedia).
  • You direct Parsoid towards the Wikipedia instance for which XML dump was made. The downside of this approach is that it is slow (you are hitting an API) and makes load on the Wikipedia servers. If you are converting whole dump it is possible that is better (e.g., less load on Wikipedia servers) to crawl and download rendered Wikipedia pages directly. Another issue is that rendered pages are not really based on dump anymore (and its snapshot of time) but integrate latest data (e.g., templates) from Wikipedia instance itself). So rendering historic dumps might not be possible this way, or at least one would not obtain exact reproduction.

So I would propose that we add to Parsoid another mode of operation which would be similar to mock, but would resolve templates and file data against a directory which would contain extracted templates and file data (I am assuming both exist in XML dump) from the dump. I am not sure what to do about extensions though. (And if there is anything else which is not available in XML dumps and is needed for proper rendering of pages themselves from the dump.)

In addition to rendering itself I think it would be also useful to expose some data about rendered page in JSON format: which templates/extensions/files are used on the rendered page (and which ones were properly processed). Which links are there (internal, external).

Use case(s):

Use cases are many: people use Wikipedia for various forms of research, being able to operate on HTML directly makes it much easier than having to parse wikitext. Similarly knowledge extraction and model training is used a lot on Wikipedia content. There are other use cases, like offline access to Wikipedia. Such mode would help efforts like mwoffliner and Kiwix. My personal use case is that I am working on a search engine for Wikipedia and it is just easier to ingest HTML than wikitext.

Benefits:

Many questions related to Parsoid is asking about rendering dumps, so having a working answer to that would benefit all of them:

Moreover, static HTML dumps have not been running for quite some time now. Such mode would mostly address that, too, providing an alternative.

Event Timeline

I'm not sure what exactly you mean by "file data", but if you're talking about the actual uploaded files being embedded in rendered pages, note that these are not included in the XML dumps (otherwise the dump files would be much, much larger than they are now). Currently dumps of media files are not publicly available like dumps of page contents are (T298394), though the first step towards that has been taken with the resolution of T262668. Note however, that even when that gets resolved, there still won't be dumps corresponding to the state of uploaded files for WMF projects for old page content dumps, so in practice it will never be possible to accurately reconstruct old articles purely from dump files, with certainty for that accuracy (ignoring of course differences in parser behavior between when the dump was made and today).

I'm not sure what exactly you mean by "file data"

No, I thought more about data needed to render the <img> tag itself, like image size, alt text, and URL of the file on commons. So my understanding is that Parsoid contacts Mediawiki API to obtain those file props when rendering wikitext. And I am assuming that those props are available in XML dump? So not media files themselves, but their properties necessary to render wikitext referencing those files? Is that available?

so in practice it will never be possible to accurately reconstruct old articles purely from dump files.

But we can get closer, I hope. My motivation is primarily in textual content for now. Media files are really the next level and as you referenced, have missing dependencies to even start working on that.

So my question for now really is: does XML dump contain everything Parsoid requires to render pages inside that XML dump itself in the same way as it would if it would contact API. My understanding is that only extensions are missing but the rest is in there?

I learned only later, that there is now Enterprise HTML dumps which return exactly what I imagined I would like to generate from XML dumps: JSON, HTML, and metadata about used templates. Very cool.

I still think it would be nice if one would be able to generate that by themselves offline. But that could be the target. To be able to obtain the same JSON for each article in the XML dump, if possible.