Page MenuHomePhabricator

Investigate what is needed to use browser based rendering for books
Closed, ResolvedPublic

Description

Task:
Currently, books that are printed as pdf not only contain a collection of articles, but also a dedicated title page, a table of contents, citations and authors for all articles at the end of the collection.
When using a browser-based rendering service like electron, what would we need to do to still support the features? This includes

  • To what extend would it be possible to add a title page and a table of contents similar to the way they are added now
  • How can we remove the references part of the individual articles?
  • Is it possible to add the references of all articles at the end similarly to how it is now done for the latex pdfs?
  • Is it possible to add the authors at the end similar to how it is done now?
  • Is there any other feature in the current book rendering that we should support, too?

Background:
One of the wishes of the 2015 German-speaking community wishlist as well as the international community wishlist was the support of tables in pdfs: T135643.
It would take an enormous engineering effort to add tables to the current latex layout in a way that 80%-90% of the tables display correctly. 10%-20% would always be off due to the different capabilities of the two media (printed, layouted page versus HTML). Therefore the idea is to offer another way of rendering pdfs. The new browser-based rendering version will not have the well designed latex layout (at least for the start), but look like the printed website - not perfect for print, but at least complete.

Event Timeline

Lea_WMDE moved this task from Incoming to Tables in pdfs on the TCB-Team board.Aug 8 2016, 3:01 PM

This is the result of my investigation:

General Description of the approach

Books rely on the Collection extension and the Offline Content Generator (OCG) bundler. The new rendering methods will use that as well because all the necessary information is contained in the bundle and it has a documented structure.

PDF generation is done in three steps:

  1. Download the book bundle. The bundle is in OCG bundle format.
  2. Create a single HTML page from the bundle, with a title page, table of contents, moved references and authors list.
  3. Send HTML to Electron renderer.

Title page

The bundle contains the file metabook.json which contains all the necessary information for a title page: Title, Subtitle and Summary.

Table of contents

In general, printing CSS-styled HTML lacks two important features: Native page numbers and references to page numbers. "Native page numbers" means page numbers generated by CSS, not page numbers that might be generated by the rendering browser (Electron in this case). For more information see this StackOverflow discussion. A table of contents would be just informative and have clickable links in the PDF, for printed out pages it would be useless.

Generating the table of contents from the bundle (without page numbers) can be done as follows: The metabook.json file contains the the chapter/article order. The order can be used to generate a table of contents with the section headings of each article. The section headings are stored in the key-value database html.db. The format is the JSON output of an API call to action=parse. That means the JSON contains a sections key with information about each section: Unique ID, text, section number, etc. Some processing of the page HTML has to be done to ensure that the IDs are really unique and can be used as links, e.g. prepending the IDs with the article name.

Moving the references

The references section can be detected in the page HTML through the CSS selector ol.references, however some pages (e.g. en:Venus) contain notes for certain references which are in a separate section, so some processing to get the right element and page/wiki-specific handling has to be implemented.

Again, the reference IDs are only unique in the article and have to be prefixed with the article name to be unique through the generated HTML.

Adding the authors

The OCG bundle comes with an authors.html file which contains the user names for each wiki page and each image. The file contents can be appended to the HTML.

More processing

The HTML from the bundle needs to be cleaned up a bit:

  • Each section header contains an "edit section" link which needs to be removed.
  • Some article have their own table of content section which might be removed.
  • Some sections contain content that has a noprint class. This leads sections with a heading and no content (e.g. "See also"). To avoid this, all elements with a noprint class must be deleted from the HTML and then all empty section headings must be deleted from the HTML and the table of contents data structure.
  • The "References" section heading also needs to be removed. This can be done by the "Remove empty sections" code because when moving the references to the end the section will be empty.

Dealing with Images

Since the book bundle contains all the images, it makes sense to use them and not make the Electron renderer re-download them. That means that the Electron renderer should be on the same machine as the rest of the processing and cannot be used as a separate service like we do with the planned extension.

WMDE-leszek triaged this task as Normal priority.Aug 25 2016, 1:11 PM
Tobi_WMDE_SW closed this task as Resolved.Aug 29 2016, 12:48 PM