Investigate what is needed to use browser based rendering for books
Closed, ResolvedPublic
Actions

Description

Task:
Currently, books that are printed as pdf not only contain a collection of articles, but also a dedicated title page, a table of contents, citations and authors for all articles at the end of the collection.
When using a browser-based rendering service like electron, what would we need to do to still support the features? This includes

To what extend would it be possible to add a title page and a table of contents similar to the way they are added now
How can we remove the references part of the individual articles?
Is it possible to add the references of all articles at the end similarly to how it is now done for the latex pdfs?
Is it possible to add the authors at the end similar to how it is done now?
Is there any other feature in the current book rendering that we should support, too?

Background:
One of the wishes of the 2015 German-speaking community wishlist as well as the international community wishlist was the support of tables in pdfs: T135643.
It would take an enormous engineering effort to add tables to the current latex layout in a way that 80%-90% of the tables display correctly. 10%-20% would always be off due to the different capabilities of the two media (printed, layouted page versus HTML). Therefore the idea is to offer another way of rendering pdfs. The new browser-based rendering version will not have the well designed latex layout (at least for the start), but look like the printed website - not perfect for print, but at least complete.

Related Objects
Search...

Status	Assigned	Task
Resolved	• JKatzWMF	T150871 [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
Resolved	None	T135643 Show tables in pdfs (#9)
Resolved	Addshore	T150185 Deploy ElectronPdfService Extension to production
Resolved	Tobi_WMDE_SW	T142201 Create a mediawiki extension for browser-based rendered pdf support
Resolved	gabriel-wmde	T142204 Investigate what is needed to use browser based rendering for books

Event Timeline

• Lea_WMDE created this task.Aug 5 2016, 1:39 PM

• Lea_WMDE moved this task from Incoming to Tables in pdfs on the TCB-Team (now WMDE-TechWish) board.Aug 8 2016, 3:01 PM

• Lea_WMDE added a project: Electron-PDFs.Aug 8 2016, 3:05 PM

This is the result of my investigation:

General Description of the approach

Books rely on the Collection extension and the Offline Content Generator (OCG) bundler. The new rendering methods will use that as well because all the necessary information is contained in the bundle and it has a documented structure.

PDF generation is done in three steps:

Download the book bundle. The bundle is in OCG bundle format.
Create a single HTML page from the bundle, with a title page, table of contents, moved references and authors list.
Send HTML to Electron renderer.

Title page

The bundle contains the file [metabook.json](https://www.mediawiki.org/wiki/Offline_content_generator/metabook.json) which contains all the necessary information for a title page: Title, Subtitle and Summary.

In general, printing CSS-styled HTML lacks two important features: Native page numbers and references to page numbers. "Native page numbers" means page numbers generated by CSS, not page numbers that might be generated by the rendering browser (Electron in this case). For more information see this StackOverflow discussion. A table of contents would be just informative and have clickable links in the PDF, for printed out pages it would be useless.

Generating the table of contents from the bundle (without page numbers) can be done as follows: The metabook.json file contains the the chapter/article order. The order can be used to generate a table of contents with the section headings of each article. The section headings are stored in the key-value database html.db. The format is the [JSON output of an API call to action=parse](https://en.wikipedia.org/w/api.php?action=help&modules=parse). That means the JSON contains a sections key with information about each section: Unique ID, text, section number, etc. Some processing of the page HTML has to be done to ensure that the IDs are really unique and can be used as links, e.g. prepending the IDs with the article name.

Moving the references

The references section can be detected in the page HTML through the CSS selector ol.references, however some pages (e.g. en:Venus) contain notes for certain references which are in a separate section, so some processing to get the right element and page/wiki-specific handling has to be implemented.

Again, the reference IDs are only unique in the article and have to be prefixed with the article name to be unique through the generated HTML.

Adding the authors

The OCG bundle comes with an authors.html file which contains the user names for each wiki page and each image. The file contents can be appended to the HTML.

More processing

The HTML from the bundle needs to be cleaned up a bit:

Each section header contains an "edit section" link which needs to be removed.
Some article have their own table of content section which might be removed.
Some sections contain content that has a noprint class. This leads sections with a heading and no content (e.g. "See also"). To avoid this, all elements with a noprint class must be deleted from the HTML and then all empty section headings must be deleted from the HTML and the table of contents data structure.
The "References" section heading also needs to be removed. This can be done by the "Remove empty sections" code because when moving the references to the end the section will be empty.

Dealing with Images

Since the book bundle contains all the images, it makes sense to use them and not make the Electron renderer re-download them. That means that the Electron renderer should be on the same machine as the rest of the processing and cannot be used as a separate service like we do with the planned extension.

• Lea_WMDE added a project: TCB-Team-Sprint-2016-08-11.Aug 19 2016, 9:39 AM

WMDE-leszek subscribed.Aug 19 2016, 9:39 AM

• Lea_WMDE moved this task from Proposed to Done on the TCB-Team-Sprint-2016-08-11 board.Aug 19 2016, 3:18 PM

WMDE-leszek triaged this task as Medium priority.Aug 25 2016, 1:11 PM

WMDE-leszek assigned this task to gabriel-wmde.Aug 25 2016, 1:16 PM

Tobi_WMDE_SW closed this task as Resolved.Aug 29 2016, 12:48 PM

Tobi_WMDE_SW moved this task from Incoming to Done on the German-Community-Wishlist board.Aug 31 2016, 2:14 PM

Tobi_WMDE_SW mentioned this in T149698: Investigate technical possibilities to render Collections using the Electron PDF service.Nov 10 2016, 2:19 PM