Feature summary:
Wikipedia is available as regular XML dump of pages. But converting that XML into HTML is currently not trivial. There are currently two options:
- You import the XML dump into local Mediawiki install and then point Parsoid to it when rendering pages (or simple save HTML pages as rendered by Mediawiki install). This requires substantial resources for large Wikipedia instances (e.g., English Wikipedia).
- You direct Parsoid towards the Wikipedia instance for which XML dump was made. The downside of this approach is that it is slow (you are hitting an API) and makes load on the Wikipedia servers. If you are converting whole dump it is possible that is better (e.g., less load on Wikipedia servers) to crawl and download rendered Wikipedia pages directly. Another issue is that rendered pages are not really based on dump anymore (and its snapshot of time) but integrate latest data (e.g., templates) from Wikipedia instance itself). So rendering historic dumps might not be possible this way, or at least one would not obtain exact reproduction.
So I would propose that we add to Parsoid another mode of operation which would be similar to mock, but would resolve templates and file data against a directory which would contain extracted templates and file data (I am assuming both exist in XML dump) from the dump. I am not sure what to do about extensions though. (And if there is anything else which is not available in XML dumps and is needed for proper rendering of pages themselves from the dump.)
In addition to rendering itself I think it would be also useful to expose some data about rendered page in JSON format: which templates/extensions/files are used on the rendered page (and which ones were properly processed). Which links are there (internal, external).
Use case(s):
Use cases are many: people use Wikipedia for various forms of research, being able to operate on HTML directly makes it much easier than having to parse wikitext. Similarly knowledge extraction and model training is used a lot on Wikipedia content. There are other use cases, like offline access to Wikipedia. Such mode would help efforts like mwoffliner and Kiwix. My personal use case is that I am working on a search engine for Wikipedia and it is just easier to ingest HTML than wikitext.
Benefits:
Many questions related to Parsoid is asking about rendering dumps, so having a working answer to that would benefit all of them:
- https://lists.wikimedia.org/hyperkitty/list/wikitext-l@lists.wikimedia.org/thread/4EULKH3VW5UR3KGX2HWQLCNLJETYAQEZ/#CPNZT6IPMUNJ3BMQ4NJZW77VKZUFUBCW
- https://www.mediawiki.org/wiki/Topic:W6h810hbhp9nu2ia
- https://www.mediawiki.org/wiki/Topic:W86r33d9uuxsmnlc
Moreover, static HTML dumps have not been running for quite some time now. Such mode would mostly address that, too, providing an alternative.