We’d like to create an extension that generates a PDF from a list of articles. This task contains the plan for creating a new HTML->PDF backend for [[ https://www.mediawiki.org/wiki/Extension:Collection | Extension:Collection]]. We’ll use [[https://wkhtmltopdf.org/ | wkhtmltopdf]] to generate PDF’s. Debian has a [[ https://packages.debian.org/jessie/utils/wkhtmltopdf | package]] for it. Below is an initial rough draft.
The main use case for the generated PDF is that it will be used as a printed book. For that the tool needs to be able to generate a PDF that has a table of contents with page numbers. The [[ https://github.com/msokk/electron-render-service | electron-render-service]] used in [[https://www.mediawiki.org/wiki/Extension:ElectronPdfService | Extension:ElectronPdfService]] does not have this capability. Another use case is that the generated PDF will be usable on a computer, where items in the table of contents, and links are clickable. The PDF should also have an outline for easy navigation. The electron-render-service doesn't have this capability either. What differs the new extension from the existing [[ https://www.mediawiki.org/wiki/Offline_content_generator | Offline Content Generator service ]] is that the extension will be able to output tables.
The extension will be used as one of the back-ends to [[https://www.mediawiki.org/wiki/Extension:Collection | Extension:Collection]]. It will expose a couple of end-points.
One of the end points will receive a payload in [[https://www.mediawiki.org/wiki/Offline_content_generator/metabook.json | the metabook format]] and start rendering a PDF. The extension will retrieve HTML versions of articles from RESTBase. It will also retrieve metadata such as authors of images from the MediaWiki API. It then makes transformations (identified in T163272) of HTML pages (or creates other HTML pages such as the cover page) using [[https://www.mediawiki.org/wiki/RemexHtml | RemexHtml]] (as was suggested in T163272#3272877) and saves them in the file system. It will then call wkhtmltopdf with the HTML file names as parameters (as shown in T163272#3284896) to generate a PDF. The PDF will be saved in the file system with a unique name.
While we can concatenate HTML files into one and generate a PDF, we don't have to as `wkhtmltopdf` allows us to pass multiple pages and generates a singel PDF. This is especially nice because we won't have to worry about ID collisions which will happen in case of concatenated HTML.
Temporary HTML files created for the purpose of generating the PDF will be immediately removed from the file system as they won’t be needed for creating other PDFs (because it’s unlikely that other books will have the same structure as the one that’s been generated). The PDF however will be kept in the file system for some period so that we can serve it without re-rendering it. Every so often we’ll have to clean up old PDF files. How often?
The extension also exposes an end point for retrieving the render status of a collection. This end point will be used by [[https://www.mediawiki.org/wiki/Extension:Collection | Extension:Collection]] to periodically check whether the requested PDF is ready.
==Open questions==
[] What details should we clarify before working on the extension? @tgr what do you think?
[] What problems may the above setup cause from the operations perspective? @faidon I'm curious to hear your opinion.
[] Other?