Page MenuHomePhabricator

[Spike 6hrs] Investigate ability of wkhtmltopdf to render single articles
Closed, ResolvedPublic

Assigned To
Authored By
ovasileva
Jun 6 2017, 11:25 AM
Referenced Files
F8463429: Trigonometric_functions.zip
Jun 15 2017, 3:59 PM
F8455109: سانتياغو.pdf
Jun 13 2017, 8:15 PM
F8455023: Trigonometric_functions.pdf
Jun 13 2017, 8:15 PM
F8454686: Berlin.pdf
Jun 13 2017, 8:15 PM
F8454816: Сантьяго — Википедия.pdf
Jun 13 2017, 8:15 PM
F8454929: Trigonometric_functions.pdf
Jun 13 2017, 8:15 PM
F8454151: Berlin.pdf
Jun 13 2017, 8:15 PM
F8454154: BerlinCover.html
Jun 13 2017, 8:15 PM

Description

We would like to evaluate the ability of wkhtmltopdf to render single paged articles. Namely, the following questions:

  • How does it perform when rendering tables
  • Can it provide page numbers?
  • Can it provide support for blue links/links to other articles?
  • Are there other noted edge cases where wkhtmltopdf breaks?
  • Is there support for a two-column layout?
  • Can it use the same(core) css file for print styles?

Also, as a result of the spike, provide rendered articles.

Note:
Here's a list of test pages that use various templates/tables:
https://en.wikipedia.org/wiki/Berlin
https://en.wikipedia.org/wiki/Trigonometric_functions
https://en.wikipedia.org/wiki/Climate_of_Australia
https://en.wikipedia.org/wiki/Santiago
https://zh.wikipedia.org/wiki/%E5%9C%A3%E5%9C%B0%E4%BA%9A%E5%93%A5_(%E6%99%BA%E5%88%A9)
https://ar.wikipedia.org/wiki/%D8%B3%D8%A7%D9%86%D8%AA%D9%8A%D8%A7%D8%BA%D9%88
https://ru.wikipedia.org/wiki/%D0%A1%D0%B0%D0%BD%D1%82%D1%8C%D1%8F%D0%B3%D0%BE

Event Timeline

Probably worth including an RTL wiki and some complex non-Western script (e.g. Chinese) as well.

To some extent this depends on the print styles task (the difference in browser support for various print-related CSS options might be relevant).

@Tgr - good call, added a few different scripts. Feel free to add other edge cases if you think of any.

ovasileva renamed this task from [Spike] Investigate ability ot wkhtmltopdf to render single articles to [Spike 6hrs] Investigate ability ot wkhtmltopdf to render single articles.Jun 6 2017, 4:25 PM
Johan renamed this task from [Spike 6hrs] Investigate ability ot wkhtmltopdf to render single articles to [Spike 6hrs] Investigate ability of wkhtmltopdf to render single articles.Jun 7 2017, 2:09 PM

@cscott any idea what else to test for? You mentioned indic script support in the past, do you have a test case at hand?

https://phabricator.wikimedia.org/T30206#327706 has some test cases. You probably want to recruit a native speaker, however: many of the ligature, character shape, and word-breaking issues are very hard to see if you can't read the script.

Results

How does it perform when rendering tables

Tables are being rendered. There is a slight problem with presentation when tables have to be split up into multiple pages. See the Berlin article infobox for example. Notice how the words "Elevation" and "Population" on page 3 are close to one another.

Can it provide page numbers?

Yes. It can output page numbers on the left, right or middle of the header/footer. However, it cannot output even page numbers on the left, and odd page numbers on the right, for example. So, page numbers should be output in the middle so that they look nice when the PDF pages are printed on both sides of paper. Otherwise we'll have a situation where pages are always on the left, or right.

Can it provide support for blue links/links to other articles?

Yes. Links to other articles, other sections, and external sites all work.

Are there other noted edge cases where wkhtmltopdf breaks?

Yes.

Occasionally, some pages are not laid out correctly. For example, in the Berlin PDF, on page 16, you'll see that the text overlaps images at the top.

Another problem can be seen on pages 19 and 20. The last line of the page 19 is split into two and the bottom part of the line appears on page 20.

On Trigonometric_functions, the first formula is missing on page 3. Animated content on page 8 doesn't look good.

Is there support for a two-column layout?

No. A workaround for single page articles exists, but of no use to us.

Other notes

  • The following command has been used to generate the Berlin PDF:
wkhtmltopdf --print-media-type --footer-right '[page] / [topage]' toc page Berlin.html Berlin.pdf

Berlin.html has been retrieved from https://en.wikipedia.org/api/rest_v1/page/html/Berlin and saved locally. Links have been modified to include the protocol, i.e. http.

PDFs

How does it perform when rendering tables

Tables are being rendered. There is a slight problem with presentation when tables have to be split up into multiple pages. See the Berlin article infobox for example. Notice how the words "Elevation" and "Population" on page 3 are close to one another.

Are there other noted edge cases where wkhtmltopdf breaks?

Yes.

Occasionally, some pages are not laid out correctly. For example, in the Berlin PDF, on page 16, you'll see that the text overlaps images at the top.

Another problem can be seen on pages 19 and 20. The last line of the page 19 is split into two and the bottom part of the line appears on page 20.

On Trigonometric_functions, the first formula is missing on page 3. Animated content on page 8 doesn't look good.

@bmansurov: Do you know to what extent we can fix any of the above by providing our own styles?

@bmansurov - here's some more notes from my side:

  • Article title is missing - can we add article title?
  • For single-page articles, TOC is not necessary - can we remove this?
  • On page 4 of the Berlin article, the text is overlapping the image - do we have a workaround for this?
  • Can we impose page breaks based on heading type and location on the page? For example, in the print for Santiago, the section “Физико-географическая характеристика”.
  • On the trigonometric functions article, do we know why the first formula is missing?
  • On the trigonometric functions article, what is happening between pages 17 and 19? (blank space, strange page breaks, text cut off)
  • The TOC in rtf languages is appearing in ltr order - would it possible to adjust this?
  • On page 3 of the سانتياغو. article, a scroll bar is appearing - can we remove these and display the entirety of the infobox content?

@ovasileva is the index of articles (for book creator) not added here on purpose? so we can add that as a requirement in another card related to book creator?

everything else looks good. I will add a minor comment in description on using same css file.

  • Can it use the same(core) css file for print styles?

It's already using the CSS file from core. What seems to be different?

nothing is different, that just means you solved a bullet point :P

Article title is missing - can we add article title?

Yes. Here is one:

Command used: wkhtmltopdf --print-media-type --footer-right '[page] / [topage]' cover BerlinCover.html toc page Berlin.html Berlin.pdf.
Cover page HTML:

For single-page articles, TOC is not necessary - can we remove this?

This functionality is not built-in. We can either output the table of contents or not.

On page 4 of the Berlin article, the text is overlapping the image - do we have a workaround for this?

Our current print styles seem to be causing the problem. I haven't identified root cause, but I suppose it's not too hard to find. Here is the same article without the print styles (Notice how the problem is gone):

Can we impose page breaks based on heading type and location on the page? For example, in the print for Santiago, the section “Физико-географическая характеристика”.

CSS has a way to not put page breaks after certain elements, like so:

h2 {page-break-after: avoid; }

The problem is that it doens't work reliably. Even on the mobile site (where we have this rule) you can see the same problem on page 4 at "Название города":

Alternatively, we can put page breaks before h2s and the output will looks something like this:

On the trigonometric functions article, do we know why the first formula is missing?

Yes. The problem seems to be with the particular formula. When I insert the same formula to the next row, I see the same problem:

And when I insert the second formula to the first row, I see both rows working fine:

I also swapped their places to see if that was the problem. Turns out, no:

So, I guess the generated SVG is malformed or new for the rendering engine and it doesn't understand it.

On the trigonometric functions article, what is happening between pages 17 and 19? (blank space, strange page breaks, text cut off)

The following inline style rules of Notes reflist seem to be causing the problem:

-moz-column-width: 30em; -webkit-column-width: 30em; column-width: 30em;

Removing these produces the following PDF:

The TOC in rtf languages is appearing in ltr order - would it possible to adjust this?

Yes. Here is the adjusted PDF:

To generate:

wkhtmltopdf --dump-default-toc-xsl > rtl.xsl
# Add `dir="rtl"` to `body` of rtl.xsl
wkhtmltopdf --print-media-type --footer-right '[page] / [topage]' toc --xsl-style-sheet rtl.xsl page "سانتياغو.html" "سانتياغو.pdf"

On page 3 of the سانتياغو. article, a scroll bar is appearing - can we remove these and display the entirety of the infobox content?

Yes. These style issues are coming from the templates, so we'll have to override them in our print styles.

Article title is missing - can we add article title?

Yes. Here is one:

Command used: wkhtmltopdf --print-media-type --footer-right '[page] / [topage]' cover BerlinCover.html toc page Berlin.html Berlin.pdf.
Cover page HTML:

Are we required to add the article title as a cover page? Can we add it to the same page as the text? The same page as the toc?

For single-page articles, TOC is not necessary - can we remove this?

This functionality is not built-in. We can either output the table of contents or not.

If we do not output it, is there an option to include it within the article (as in the current electron implementation)?

On the trigonometric functions article, do we know why the first formula is missing?

Yes. The problem seems to be with the particular formula. When I insert the same formula to the next row, I see the same problem:

And when I insert the second formula to the first row, I see both rows working fine:

I also swapped their places to see if that was the problem. Turns out, no:

So, I guess the generated SVG is malformed or new for the rendering engine and it doesn't understand it.

Is it possible to add these to the renderer in some way?

Are we required to add the article title as a cover page? Can we add it to the same page as the text? The same page as the toc?

Do you mean the title page should not exist and only appear before the table of contents? The suggested way by wkhtmltopdf was to add a cover page separate from content. We're not required to do so. Tell me how you want it and I'll see if that's achievable.

If we do not output it, is there an option to include it within the article (as in the current electron implementation)?

Yes, we'd use the article URL as opposed to RESTBase URL. At any rate, that's the easiest way. We could also keep using the RESTBase URL and add the Table of Contents ourselves.

Is it possible to add these to the renderer in some way?

I suppose so. The practicality of this becomes an issue though. We'd have to understand the internals of wkhtmltopdf and upstream a patch (as we don't want to maintain a fork because that would lead to other maintenance issues). Alternatively, we could try and not use the generated SVGs, but use the fallback images. Let me see if I can do that. I'll follow up with a comment.

I've narrowed down the issue with the problematic trigonometric function. The SVG contains the following property:

width="27.999ex"

Any value between 27.990 and 27.999 causes the issue. If I change the value to 27.989 or 28, then the problem is gone. I don't know what the best fix is, but it's certainly fixable. One solution is to download this kind of problematic SVGs and modify their widths, then replace the original SVG URLs with the modified file paths. Another solution maybe to upstream the issue to the Math extension. We could also try using a different font family and size.

Here is the fixed PDF:

Conclusion is that wkhtml can support most of our use cases. We will compare to results of T168004: [Spike 6hrs] Investigate ability of vivliostyle to render single articles