⚓ T135616 Investigate underlying issues with tables in PDF rendering

Status	Assigned	Task
Resolved	• JKatzWMF	T150871 [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
Resolved	None	T135643 Show tables in pdfs (#9)
Resolved	WMDE-Fisch	T135616 Investigate underlying issues with tables in PDF rendering

Tobi_WMDE_SW created this task.May 18 2016, 12:01 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 18 2016, 12:01 PM

Lea_WMDE added a parent task: T135643: Show tables in pdfs (#9).May 18 2016, 4:00 PM

Lea_WMDE removed a parent task: T73808: Support tables in PDF rendering (tracking).May 18 2016, 4:02 PM

Lea_WMDE updated the task description. (Show Details)May 18 2016, 4:22 PM

Lea_WMDE moved this task from Incoming to Tables in pdfs on the TCB-Team (now WMDE-TechWish) board.May 18 2016, 4:36 PM

The main issue is that LaTeX doesn't handle wide tables well, and it is technically difficult to determine the width of a wikitext table.

There are basic patches for table support at https://gerrit.wikimedia.org/r/107587 -- it is possible we could use a whitelist mechanism of some sort to enable tables only when the page author can guarantee that they are narrow enough, or something like that.

A proper solution would probably use LaTeX itself in a loop to compute the width of the proposed table, and then try several different alternative layout strategies if it were "too wide". If it were wider than a single column, it would first shift to double-column mode. If it were still wider, perhaps the font size would be reduced. If it were still too wide, perhaps some columns would be removed, etc.

A similar issue applies if the table is "too tall", since LaTeX doesn't handle tables breaking across page boundaries well. A "too tall" table would probably first be floated to a page by itself. If it were still "too tall" then we could switch to one of the latex packages which allows splitting the table across multiple pages, etc.

Broader issues here:

The codebase should be refactored so that wiki-specific hacks are separated from general layout. The LaTeX renderer contains some enwiki-specific hacks; these should be factored out.
We ought to add a general mechanism for providing printing hints -- probably specially-named classes that the author could add to the markup. "Whitelist this table", "layout this article in single-column mode", "float this table as a double-column figure", etc, could all be hints.

Lea_WMDE updated the task description. (Show Details)May 19 2016, 10:31 AM

Tobi_WMDE_SW set the point value for this task to 21.May 19 2016, 2:28 PM

Tobi_WMDE_SW moved this task from Proposed to Backlog on the TCB-Team-Sprint-2016-05-19 board.

It would appears OCG can render tables? but no tables at all are rendered for wikimedia?
Or are they and I am looking in all of the wrong places?

How does the rendering currently work?

The Collection extension uses several tools to provide the functionality given on Wikipedia.

mw-ocg-bundler that grabs and bundles all dependencies for a given set of articles and puts them into a .zip file.
mw-ocg-latexer takes zipped bundles from the bundler and converts the input to PDFs using LaTeX
mw-ocg-service provides a server-side service to allow Mediawiki Users to convert articles using the above tools

Why are tables skipped?

Because there are a lot of table formats and converting them to nice LaTeX tables fitting the output format is non-trivial. So tables are skipped in the script. See also the comments by @cscott

Is this the same issue as trying to download a book in pdf format? (see [1])

As I understand it, yes.

Are there other template types that don't get supported as well? (see[1])

e.g. all the Infobox templates

A general estimation of how hard it is to change something in the pdf template (i.e. what needs / should be done)

Adding or change stuff how the PDF is generated done seem to hard. Basically its about parsing the content from the bundles and converting it to sane and pretty LaTeX. The patch mentioned by @cscott already gives a good idea how that could be done for tables.

An estimate of the differences between table in pdf and table in book problems

As I understand it the problems are the same and its all about different table formats and especially sizes that need to be considered.

Is it possible to include/exclude table/template types?

As I see it, the parser can analyze the DOM of the generated HTML and therefore could also parse CSS classes or other attributes present. This could be used to differentiate between certain templates and elements.

WMDE-Fisch claimed this task.Jun 1 2016, 10:08 AM

WMDE-Fisch moved this task from Backlog to Review on the TCB-Team-Sprint-2016-05-19 board.

@WMDE-Fisch great, thanks for the insights! Open questions on my side:

Would it make sense to tackle the table/pdf issues table type by table type?

If yes, I would love to have

A table with all table / template types that would have to be tackled, with the following info:
- table / template name
- rough estimate how often that type is used
- rough estimate how much work it is to implement it

Tobi_WMDE_SW added a project: TCB-Team-Sprint-2016-06-02.Jun 2 2016, 1:48 PM

Tobi_WMDE_SW moved this task from Proposed to Review on the TCB-Team-Sprint-2016-06-02 board.

Changing story points to 2 for wrapping up what we've discussed, then this task can be closed.

In general specific templates or table types are not the problem but, as @cscott states, tables to wide or to tall to display them nicely. Furthermore fitting them into the surrounding text and article structure without blowing the whole layout is quite complicated.

As discussed, one first approach could be showing all tables in the appendix with references and links to them in the text.

https://gerrit.wikimedia.org/r/107587 is a good starting point for the basic table parsing. Some open issues should be fixed first especially with col- and rowspan. Additional to that Infobox should be handled separately and could be ignored for the start. These things could be tasks for the upcoming Hackathon at the Wikimania.

General refactoring and code cleanup would definitely help.

WMDE-Fisch closed this task as Resolved.Jun 2 2016, 3:30 PM

WMDE-Fisch moved this task from Review to Done on the TCB-Team-Sprint-2016-06-02 board.

In T135616#2344343, @WMDE-Fisch wrote:

As I see it, the parser can analyze the DOM of the generated HTML and therefore could also parse CSS classes or other attributes present. This could be used to differentiate between certain templates and elements.

Note that there are limits to the CSS parsing of the DOM implementation used. You can access class names (those are present as attributes in the DOM) and you can do basic parsing of CSS properties directly specified in style attributes, but there is no support for parsing a full CSS stylesheet to determine which properties ought to be applied, nor is there support for calculating any computed properties (such as element size). A few more details at https://github.com/fgnass/domino/issues/50

So it is a challenge to determine if a table is "too wide" or "too tall". You can either explicitly whitelist certain tables with an explicit class attribute, or else (better) use the LaTeX output to determine the size. Trying to rely on CSS properties for size is probably not a good idea.

Further, "displaying all tables in the appendix" isn't really a solution, since tables which are too wide/tall can crash LaTeX and then you get no output at all. This is the main reason I initially decided it was better to suppress tables by default.

use the LaTeX output to determine the size. Trying to rely on CSS properties for size is probably not a good idea.

Further, "displaying all tables in the appendix" isn't really a solution, since tables which are too wide/tall can crash LaTeX and then you get no output at all. This is the main reason I initially decided it was better to suppress tables by default.

@cscott Do you think it would work if we were to use the LaTeX output to determine the size of a table, and to skip the table if it was too big / wide (ideally adding a text message with "Table too big /wide to be printed")?

WMDE-leszek subscribed.Jun 20 2016, 9:51 AM

Addshore moved this task from Incoming to Done on the German-Community-Wishlist board.Jul 7 2016, 11:00 AM

Investigate underlying issues with tables in PDF rendering
Closed, ResolvedPublic2 Estimated Story Points
Actions

Description

Related Objects
Search...

Event Timeline

Investigate underlying issues with tables in PDF renderingClosed, ResolvedPublic2 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Investigate underlying issues with tables in PDF rendering
Closed, ResolvedPublic2 Estimated Story Points
Actions

Related Objects
Search...