Page MenuHomePhabricator

[EPIC] Adding PDF TOC with PDF page numbers to electron
Closed, InvalidPublic

Description

Reading Web has a hard requirement to support a PDF Table of contents with page numbers for any new PDF generation service that will replace OCG in the collection extension.

Currently 2 options are being evaluated:
https://wkhtmltopdf.org/
https://github.com/msokk/electron-render-service

wkhtmltopdf has support for the ToC feature required by product, while Electron does not. One aspect that has not been quantified is how difficult it would be to implement the PDF ToC feature into Electron. ToC requirements here: https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality#Books

As Electron is already available in an extension, it would be good to quantify this effort before selecting a different solution.

Some questions to answer:

  1. What is the time estimate involved in adding these features to Electron by wrapping the service?
  1. In addition to evaluating the effort of adding these features by wrapping electron, it also important to evaluate extending Electron itself. The reason being is that we may have an issue with needing to generate 2 PDFs.
  1. Do we really need to render twice to get the PDF?
  1. If so, is generating 2 PDFs that much of an issue? Performance wise, complexity wise?

Related Objects

StatusSubtypeAssignedTask
Resolved JKatzWMF
ResolvedTheDJ
InvalidNone
ResolvedSpikephuedx
InvalidSpikeNone
Resolved Nirzar
Resolved Nirzar
DeclinedJdlrobson
DeclinedNone
Resolved Nirzar
Resolved Nirzar
Resolved Nirzar
Resolvedovasileva
ResolvedABorbaWMF
Resolvedovasileva
DeclinedNone
ResolvedABorbaWMF
ResolvedJdlrobson
Resolvedovasileva

Event Timeline

There are two distinct features here: TOC page numbers, and document outline (TOC-like metadata that can be handed natively by the PDF viewer). Outlines are less crucial, I imagine.

@Tgr feel free to break this up if you like - I just added the requirement as is from the product perspective.

@Tgr adding some specific questions to the description.

@pmiazga @bmansurov do you have any data here to help with this work?

Pasting what I wrote in T166188:

Alternatively, we could in theory render a PDF using electron, and then add page numbers and the table of contents with page numbers using another tool. If we go that route we'll still have to depend on another toolkit to do the job. I've looked at Pdftk and it seemed abandoned. The latest version appeared about 4 years ago. Another library I checked out was QPDF, whose latest release (version 6.0.0) was at the end of 2015, although there's been some activity at github since then. On the other hand the latest stable release (version 0.12.4) of wkhtmltopdf was done at the end of 2016. There maybe other tools that we can use, and I'm open to exploring them. However, out of the above 3 tools, wkhtmltopdf is both new and the easiest to deal with. It's easy because with the other tools, we'll have to use electron first, and then do other transformations to the PDF. I'm not even sure if those tools support the requirements we have.

See the task description for links.

@bmansurov did you investigate implementing the ToC within Electron itself? if it is implemented in wkhtmltopdf, it would stand to reason that this feature would be possible to add to Electron, right?

No, I haven't. I just shared the little info I had. I think it is possible to add the feature to Electron, but I don't know how practical it will be. Since Electron is using browser's print-to-PDF functionality, we'd either have to add support to that, or create a tool that adds ToC to the PDF generated by Electron.

Thanks @bmansurov I think that defines the scope of this ticket.

@Tgr do you have anymore questions to help with the spike?

Electron (unsurprisingly) offers the same print options as Chrome itself. Asking in their forum might be worth a shot but I doubt Electron could be easily modified to do stuff Chrome itself can't do.

@Fjalapeno

Do we really need to render twice to get the PDF?

When do we render twice? Do you mean wikitext -> HTML and HTML -> PDF?

@Tgr I believe @bmansurov and @pmiazga said they need to render the content first and then render the toc after.

@Tgr besides extending electron, what about dxt being the electron render service itself? https://github.com/msokk/electron-render-service

The magic behind the TOC in wkhtmltopdf is [[https://github.com/wkhtmltopdf/qt/blob/c0cfa03a072789550d8ff5724b2e5e58436e02d1/src/3rdparty/webkit/Source/WebKit/qt/Api/qwebframe.cpp#L276-L313|QWebPrinter::elementLocation()]], which is something they added to their Qt/Webkit fork. It uses the standard Webkit method [[https://github.com/WebKit/webkit/blob/master/Source/WebCore/page/PrintContext.h#L58|PrintContext::pageRects()]] to get page positions, and compares them with the element's bounding rectangle to get page numbers. (They then extract the page numbers into an XML file and run an XSL transform on it to get a HTML TOC, which they prepend to the document.)

So in theory the logic is not that hard to replicate in another WebKit-based framework, but as far as I understand electron's architecture, it really doesn't fit - unlike wkhtmltopdf which seems to interact with internal WebKit objects directly, electron uses Chromium IPC - there is a proper headless Chromium process running in the background, and the "manager" process sends IPC commands to it, so it's limited to functionality that Chromium exposes that way. (That is a much saner architecture in general, as you get the security and stability guarantees of Chromium, but it's less flexible.)

@Tgr hmmm.... are you saying that you can't easily get the page rects with the current chromium ipc system?

If so, I guess that the solution would be to add that functionality into chromium ipc so we can get that information? And is that extremely difficult?

I also guess that would mean we need to run a custom version chrome to make this work if we need to add functionality like this?

Some of the JS tools I referenced in T134205 did / do pagination in browsers (see "Browser print improvement projects" in the description). The earlier tools used CSS regions, but those were removed from Chrome again. The most up to date tool seems to be vivliostyle, which is discussed in some detail in T135022. For example, here is a pure-JS paginated view of [[Barack Obama]]: http://vivliostyle.github.io/vivliostyle.js/viewer/vivliostyle-viewer.html#x=https://en.wikipedia.org/api/rest_v1/page/html/Barack_Obama&f=epubcfi(/2!)

I still think it is very much worth reaching out to the vivliostyle folks, as setting up generally improved pagination & tocs would benefit both client side & server side printing. For them it would be a great showcase of their open source tool.

@Tgr hmmm.... are you saying that you can't easily get the page rects with the current chromium ipc system?

That's my quick and rather unreliable assessment of the code. Finding a contact in the Chromium developer community would probably yield a more reliable answer.

If so, I guess that the solution would be to add that functionality into chromium ipc so we can get that information? And is that extremely difficult?
I also guess that would mean we need to run a custom version chrome to make this work if we need to add functionality like this?

Well, the Chromium codebase is about 30x larger than MediaWiki, written in C++ (which also means that, unlike with dynamic languages, it is rather easy to create remote execution vulnerabilities), and none of us has a clue about it. And a local patch would have to be reapplied (and quite possibly rewritten) every time we upgrade. So that's far beyond plausible, IMO. Maybe in the longer term Google or the Electron developer community would be interested to support us by adding this kind of functionality themselves.

There are two other possible approaches:

  • post-processing the PDF - it's already paginated so parse through, find the headings, record their page number, use the metadata to add an outline and generate a preface with links and page number. In general making sense of a PDF is quite hard; a PDF is basically just a set of line boxes, a separate one for each line, font face, font size etc. with no semantic information so trying to reconstruct which box is the title is very fragile. But since we are the ones generating the PDF in the first place, maybe we can put in some kind of easily identifiable marker - generate a transparent div with a unique ID, positioned above the heading, and hope the browser is un-clever enough to print it, something like that. Or maybe there is some metadata that's preserved when printing to PDF (anchors, maybe).
  • pre-processing the HTML - format it in a way that's similar to how it will look in print and try to guess where the page borders are. That's just some basic CSS math but unlikely to be reliable since we don't know what exact transformation the browser does on the document before turning it into PDF.

pre-processing the HTML - format it in a way that's similar to how it will look in print and try to guess where the page borders are. That's just some basic CSS math but unlikely to be reliable since we don't know what exact transformation the browser does on the document before turning it into PDF.

Check out the vivlio link I gave above. At least for Obama, the pagination in the preview does match the print output exactly.

Check out the vivlio link I gave above. At least for Obama, the pagination in the preview does match the print output exactly.

Interesting! At a glance they don't really paginate, they create a bunch of fixed-size page boxes, put a copy of the full Wikipedia page in each one, and use relative positioning to show a different segment of the page in each box. But it must be more clever than that because they never cut lines.

(On the wider point, I agree collaborating more with developer communities and Not Inventing Here would be cool.)

Check out the vivlio link I gave above. At least for Obama, the pagination in the preview does match the print output exactly.

Interesting! At a glance they don't really paginate, they create a bunch of fixed-size page boxes, put a copy of the full Wikipedia page in each one, and use relative positioning to show a different segment of the page in each box. But it must be more clever than that because they never cut lines.

I inspected the source a bit, and it looked like they did actually cut up the content into the boxes. Only the top-level wrapper element attributes seemed to be repetitive. There were no iframes or the like. With each box sized to fill one page, this would explain why no lines are cut.

However, the way long tables are cut off does not look quite optimal yet. They seem to be missing code that re-opens the table with the remaining content in the next page box, potentially repeating table headings. I don't see anything in their approach that would prevent this to be added though, and the result would be an improvement over the arbitrary (often mid-line) cut most browsers implement for printing.

Marking up the PDF is possible but highly impractical.

I prepended a fake link to the h2 tag: <h2><a href="#electron-toc-Notes">&nbsp;</a><span id="electron-toc-Notes"></span>... (Chrome seems clever enough to filter out links which do not have a corresponding anchor, or which are completely invisible, e.g. only contain a zero-width character.), generated the PDF with <h2><a href="#electron-toc-Notes">&nbsp;</a><span id="electron-toc-Notes"></span> (using https://gerrit.wikimedia.org/r/#/c/356991/ ), and could extract the page number with smalot/pdfparser:

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile( 'Berlin-electron.pdf' );
$pages  = $pdf->getPages();

foreach ( $pages as $pageNum => $page ) {
    $details = $page->getHeader()->getDetails();
    $annotations = isset( $details['Annots'] ) ? $details['Annots'] : [];
    foreach ( $annotations as $annotation ) {
        if ( $annotation['Subtype'] === 'Link' && isset( $annotation['Dest'] ) ) {
            $dest = $annotation['Dest'];
            if ( preg_match( '/^electron-toc-/', $dest ) ) {
                echo preg_replace( '/^electron-toc-/', '', $dest ) . ': ' . ( $pageNum + 1 ) . PHP_EOL;
            }
        }
    }
}

Actually there is a much simpler way. PDF documents contain the targets of internal links (such as the TOC link) as Destination objects (PDF spec 12.3.2.2) with the id value as the name, and the Destination object includes a reference to the parent page. So given an id (which is already provided by MCS for example) one can easily extract the page number:

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile( 'Berlin-electron.pdf' );
$pages  = $pdf->getPages();

$lastObject = end( $pdf->getObjects() );
$page = $lastObject->getHeader()->get( $id )->getContent()[0];
$pageNumber = array_search( $page, $pages ) + 1;

echo "page: $pages\n";

So this would allow the extraction of page numbers of headings from the PDF, given the TOC ids of the original page. It's still impractical because it would force a two-step PDF generation plus it would probably require some awkward changes to the API of the electron service (change it to multipart/form-data so the TOC can be passed, or something like that). Nevertheless, it's doable in a bind.

One aspect I find attractive about doing the pagination client side is that it would also benefit the (probably larger number of) people printing things from their browser. There is no need to re-download the content, and printing would even work from a web app in offline mode.

One aspect I find attractive about doing the pagination client side is that it would also benefit the (probably larger number of) people printing things from their browser. There is no need to re-download the content, and printing would even work from a web app in offline mode.

Agreed in theory; Vivliostyle Viewer has a very heavy UI though and it would probably be very confusing to users. Plus it's a single page app so we'd need to direct users there before printing.
(In theory, vivliostyle.js can be used on its own, with no UI, but 1) that does not work out of the box, 2) not sure when or how it would be set up as there is no equivalent of @print in Javascript, and beforeprint is not async.)

Jdlrobson renamed this task from [Spike] Investigate adding PDF TOC with PDF page numbers to electron to [EPIC] Adding PDF TOC with PDF page numbers to electron.Feb 5 2018, 6:40 PM

Closing as per T184772#4116906. Pediapress will be taking on books functionality from this point forward.