Page MenuHomePhabricator

Investigate underlying issues with tables in PDF rendering
Closed, ResolvedPublic2 Estimated Story Points

Description

We need to understand the underlying problem why it is currently not possible to include tables when rendering PDF.

  • How does the rendering currently work?
  • Why are tables skipped?
  • What are the table types that exist in pdfs? (some of them might be mentioned in the tasks in T73808: Support tables in PDF rendering (tracking))
  • Is this the same issue as trying to download a book in pdf format? (see [1])
  • Are there other template types that don't get supported as well? (see[1])

For investigating the situation in general, T134205: Options for browser-based server-side PDF generation might be interesting, too.

Expected outcome:

  • A general estimation of how hard it is to change something in the pdf template (i.e. what needs / should be done)

If manipulating the pdf template is a doable amount of work:

  • An estimate of the differences between table in pdf and table in book problems
  • A table with all table / template types that would have to be tackled, with the following info:
    • table / template name
    • rough estimate how often that type is used
    • rough estimate how much work it is to implement it

[1] https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Templates#Inserting_templates_.28tables.29_in_PDF_and_Books

Event Timeline

The main issue is that LaTeX doesn't handle wide tables well, and it is technically difficult to determine the width of a wikitext table.

There are basic patches for table support at https://gerrit.wikimedia.org/r/107587 -- it is possible we could use a whitelist mechanism of some sort to enable tables only when the page author can guarantee that they are narrow enough, or something like that.

A proper solution would probably use LaTeX itself in a loop to compute the width of the proposed table, and then try several different alternative layout strategies if it were "too wide". If it were wider than a single column, it would first shift to double-column mode. If it were still wider, perhaps the font size would be reduced. If it were still too wide, perhaps some columns would be removed, etc.

A similar issue applies if the table is "too tall", since LaTeX doesn't handle tables breaking across page boundaries well. A "too tall" table would probably first be floated to a page by itself. If it were still "too tall" then we could switch to one of the latex packages which allows splitting the table across multiple pages, etc.

Broader issues here:

  1. The codebase should be refactored so that wiki-specific hacks are separated from general layout. The LaTeX renderer contains some enwiki-specific hacks; these should be factored out.
  2. We ought to add a general mechanism for providing printing hints -- probably specially-named classes that the author could add to the markup. "Whitelist this table", "layout this article in single-column mode", "float this table as a double-column figure", etc, could all be hints.
Tobi_WMDE_SW set the point value for this task to 21.May 19 2016, 2:28 PM
Tobi_WMDE_SW moved this task from Proposed to Backlog on the TCB-Team-Sprint-2016-05-19 board.

It would appears OCG can render tables? but no tables at all are rendered for wikimedia?
Or are they and I am looking in all of the wrong places?

How does the rendering currently work?

The Collection extension uses several tools to provide the functionality given on Wikipedia.

  • mw-ocg-bundler that grabs and bundles all dependencies for a given set of articles and puts them into a .zip file.
  • mw-ocg-latexer takes zipped bundles from the bundler and converts the input to PDFs using LaTeX
  • mw-ocg-service provides a server-side service to allow Mediawiki Users to convert articles using the above tools
Why are tables skipped?

Because there are a lot of table formats and converting them to nice LaTeX tables fitting the output format is non-trivial. So tables are skipped in the script. See also the comments by @cscott

Is this the same issue as trying to download a book in pdf format? (see [1])

As I understand it, yes.

Are there other template types that don't get supported as well? (see[1])

e.g. all the Infobox templates

A general estimation of how hard it is to change something in the pdf template (i.e. what needs / should be done)

Adding or change stuff how the PDF is generated done seem to hard. Basically its about parsing the content from the bundles and converting it to sane and pretty LaTeX. The patch mentioned by @cscott already gives a good idea how that could be done for tables.

An estimate of the differences between table in pdf and table in book problems

As I understand it the problems are the same and its all about different table formats and especially sizes that need to be considered.

Is it possible to include/exclude table/template types?

As I see it, the parser can analyze the DOM of the generated HTML and therefore could also parse CSS classes or other attributes present. This could be used to differentiate between certain templates and elements.

@WMDE-Fisch great, thanks for the insights! Open questions on my side:

  • Would it make sense to tackle the table/pdf issues table type by table type?

If yes, I would love to have

  • A table with all table / template types that would have to be tackled, with the following info:
    • table / template name
    • rough estimate how often that type is used
    • rough estimate how much work it is to implement it
Tobi_WMDE_SW changed the point value for this task from 21 to 2.Jun 2 2016, 2:10 PM

Changing story points to 2 for wrapping up what we've discussed, then this task can be closed.

In general specific templates or table types are not the problem but, as @cscott states, tables to wide or to tall to display them nicely. Furthermore fitting them into the surrounding text and article structure without blowing the whole layout is quite complicated.

As discussed, one first approach could be showing all tables in the appendix with references and links to them in the text.

https://gerrit.wikimedia.org/r/107587 is a good starting point for the basic table parsing. Some open issues should be fixed first especially with col- and rowspan. Additional to that Infobox should be handled separately and could be ignored for the start. These things could be tasks for the upcoming Hackathon at the Wikimania.

General refactoring and code cleanup would definitely help.

WMDE-Fisch moved this task from Review to Done on the TCB-Team-Sprint-2016-06-02 board.

As I see it, the parser can analyze the DOM of the generated HTML and therefore could also parse CSS classes or other attributes present. This could be used to differentiate between certain templates and elements.

Note that there are limits to the CSS parsing of the DOM implementation used. You can access class names (those are present as attributes in the DOM) and you can do basic parsing of CSS properties directly specified in style attributes, but there is no support for parsing a full CSS stylesheet to determine which properties ought to be applied, nor is there support for calculating any computed properties (such as element size). A few more details at https://github.com/fgnass/domino/issues/50

So it is a challenge to determine if a table is "too wide" or "too tall". You can either explicitly whitelist certain tables with an explicit class attribute, or else (better) use the LaTeX output to determine the size. Trying to rely on CSS properties for size is probably not a good idea.

Further, "displaying all tables in the appendix" isn't really a solution, since tables which are too wide/tall can crash LaTeX and then you get no output at all. This is the main reason I initially decided it was better to suppress tables by default.

use the LaTeX output to determine the size. Trying to rely on CSS properties for size is probably not a good idea.

Further, "displaying all tables in the appendix" isn't really a solution, since tables which are too wide/tall can crash LaTeX and then you get no output at all. This is the main reason I initially decided it was better to suppress tables by default.

@cscott Do you think it would work if we were to use the LaTeX output to determine the size of a table, and to skip the table if it was too big / wide (ideally adding a text message with "Table too big /wide to be printed")?