We’d like to create an extension that generates a PDF from a list of articles. This task contains the plan for creating a new HTML->PDF backend for Extension:Collection. We’ll use wkhtmltopdf to generate PDF’s. Debian has a package for it. Below is an initial rough draft.
The main use case for the generated PDF is that it will be used as a printed book. For that the tool needs to be able to generate a PDF that has a table of contents with page numbers. The electron-render-service used in Extension:ElectronPdfService does not have this capability. Another use case is that the generated PDF will be usable on a computer, where items in the table of contents, and links are clickable. The PDF should also have an outline for easy navigation. The electron-render-service doesn't have this capability either. What differs the new extension from the existing Offline Content Generator service is that the extension will be able to output tables.
Alternatively, we could in theory render a PDF using electron, and then add page numbers and the table of contents with page numbers using another tool. If we go that route we'll still have to depend on another toolkit to do the job. I've looked at Pdftk and it seemed abandoned. The latest version appeared about 4 years ago. Another library I checked out was QPDF, whose latest release (version 6.0.0) was at the end of 2015, although there's been some activity at github since then. On the other hand the latest stable release (version 0.12.4) of wkhtmltopdf was done at the end of 2016. There maybe other tools that we can use, and I'm open to exploring them. However, out of the above 3 tools, wkhtmltopdf is both new and the easiest to deal with. It's easy because with the other tools, we'll have to use electron first, and then do other transformations to the PDF. I'm not even sure if those tools support the requirements we have.
So, the extension will be used as one of the back-ends to Extension:Collection. It will expose a couple of end-points.
One of the end points will receive a payload in the metabook format and start rendering a PDF. The extension will retrieve HTML versions of articles from RESTBase. It will also retrieve metadata such as authors of images from the MediaWiki API. It then makes transformations (identified in T163272) of HTML pages (or creates other HTML pages such as the cover page) using RemexHtml (as was suggested in T163272#3272877) and saves them in the file system. It will then call wkhtmltopdf with the HTML file names as parameters (as shown in T163272#3284896) to generate a PDF. The PDF will be saved in the file system with a unique name.
While we can concatenate HTML files into one and generate a PDF, we don't have to as wkhtmltopdf allows us to pass multiple pages and generates a singel PDF. This is especially nice because we won't have to worry about ID collisions which will happen in case of concatenated HTML.
Temporary HTML files created for the purpose of generating the PDF will be immediately removed from the file system as they won’t be needed for creating other PDFs (because it’s unlikely that other books will have the same structure as the one that’s been generated). The PDF however will be kept in the file system for some period so that we can serve it without re-rendering it. Every so often we’ll have to clean up old PDF files. How often?
The extension also exposes an end point for retrieving the render status of a collection. This end point will be used by Extension:Collection to periodically check whether the requested PDF is ready.