[EPIC] Adding PDF TOC with PDF page numbers to electron
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	• Fjalapeno
	Jun 6 2017, 9:00 PM

Description

Reading Web has a hard requirement to support a PDF Table of contents with page numbers for any new PDF generation service that will replace OCG in the collection extension.

Currently 2 options are being evaluated:
https://wkhtmltopdf.org/
https://github.com/msokk/electron-render-service

wkhtmltopdf has support for the ToC feature required by product, while Electron does not. One aspect that has not been quantified is how difficult it would be to implement the PDF ToC feature into Electron. ToC requirements here: https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality#Books

As Electron is already available in an extension, it would be good to quantify this effort before selecting a different solution.

Some questions to answer:

What is the time estimate involved in adding these features to Electron by wrapping the service?

In addition to evaluating the effort of adding these features by wrapping electron, it also important to evaluate extending Electron itself. The reason being is that we may have an issue with needing to generate 2 PDFs.

Do we really need to render twice to get the PDF?

If so, is generating 2 PDFs that much of an issue? Performance wise, complexity wise?

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• JKatzWMF	T150871 [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
Resolved		TheDJ	T150872 Replace OCG in collection extension with Electron
Invalid		None	T167210 [EPIC] Adding PDF TOC with PDF page numbers to electron
Resolved	Spike	phuedx	T168004 [Spike 6hrs] Investigate ability of vivliostyle to render single articles
Invalid	Spike	None	T169738 [Spike 8hrs] Investigate ability of using post-processing approach with new print styles
Resolved		• Nirzar	T169823 Design changes to desktop print styles
Resolved		• Nirzar	T169826 Add project wordmark to print styles
Declined		Jdlrobson	T171114 Show important indicators in print styles
Declined		None	T171250 Display categories while printing articles again
Resolved		• Nirzar	T171330 Justify text in print styles?
Resolved		• Nirzar	T172144 References take up more space with new print styles than existing print style
Resolved		• Nirzar	T173767 reduce space taken by TOC in new print styles
Resolved		ovasileva	T179363 Remove tocnumbers from TOC layout in print mode as they display incorrectly with numbers > 10 and their usefulness is debatable
Resolved		ABorbaWMF	T172184 QA of new desktop print styles
Resolved		ovasileva	T172414 Add print styles test articles to the beta cluster
Declined		None	T174955 Infobox styling disappears in firefox
Resolved		ABorbaWMF	T174957 Infobox breaking toc in new print styles
Resolved		Jdlrobson	T176463 [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies
Resolved		ovasileva	T168871 Introduct toc with page numbers during pdf post-processing

Event Timeline

• Fjalapeno created this task.Jun 6 2017, 9:00 PM

Jdlrobson moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.Jun 6 2017, 9:44 PM

There are two distinct features here: TOC page numbers, and document outline (TOC-like metadata that can be handed natively by the PDF viewer). Outlines are less crucial, I imagine.

@Tgr feel free to break this up if you like - I just added the requirement as is from the product perspective.

@Tgr adding some specific questions to the description.

@pmiazga @bmansurov do you have any data here to help with this work?

ovasileva mentioned this in T166188: Architecture of new rendering backend for Extension:Collection.Jun 7 2017, 4:03 PM

ovasileva updated the task description. (Show Details)

• Fjalapeno added a parent task: T150872: Replace OCG in collection extension with Electron.Jun 7 2017, 4:16 PM

Pasting what I wrote in T166188:

Alternatively, we could in theory render a PDF using electron, and then add page numbers and the table of contents with page numbers using another tool. If we go that route we'll still have to depend on another toolkit to do the job. I've looked at Pdftk and it seemed abandoned. The latest version appeared about 4 years ago. Another library I checked out was QPDF, whose latest release (version 6.0.0) was at the end of 2015, although there's been some activity at github since then. On the other hand the latest stable release (version 0.12.4) of wkhtmltopdf was done at the end of 2016. There maybe other tools that we can use, and I'm open to exploring them. However, out of the above 3 tools, wkhtmltopdf is both new and the easiest to deal with. It's easy because with the other tools, we'll have to use electron first, and then do other transformations to the PDF. I'm not even sure if those tools support the requirements we have.

See the task description for links.

@bmansurov did you investigate implementing the ToC within Electron itself? if it is implemented in wkhtmltopdf, it would stand to reason that this feature would be possible to add to Electron, right?

No, I haven't. I just shared the little info I had. I think it is possible to add the feature to Electron, but I don't know how practical it will be. Since Electron is using browser's print-to-PDF functionality, we'd either have to add support to that, or create a tool that adds ToC to the PDF generated by Electron.

Thanks @bmansurov I think that defines the scope of this ticket.

@Tgr do you have anymore questions to help with the spike?

Electron (unsurprisingly) offers the same print options as Chrome itself. Asking in their forum might be worth a shot but I doubt Electron could be easily modified to do stuff Chrome itself can't do.

@Fjalapeno

Do we really need to render twice to get the PDF?

When do we render twice? Do you mean wikitext -> HTML and HTML -> PDF?

@Tgr I believe @bmansurov and @pmiazga said they need to render the content first and then render the toc after.

@Tgr besides extending electron, what about dxt being the electron render service itself? https://github.com/msokk/electron-render-service

The magic behind the TOC in wkhtmltopdf is [[https://github.com/wkhtmltopdf/qt/blob/c0cfa03a072789550d8ff5724b2e5e58436e02d1/src/3rdparty/webkit/Source/WebKit/qt/Api/qwebframe.cpp#L276-L313|QWebPrinter::elementLocation()]], which is something they added to their Qt/Webkit fork. It uses the standard Webkit method [[https://github.com/WebKit/webkit/blob/master/Source/WebCore/page/PrintContext.h#L58|PrintContext::pageRects()]] to get page positions, and compares them with the element's bounding rectangle to get page numbers. (They then extract the page numbers into an XML file and run an XSL transform on it to get a HTML TOC, which they prepend to the document.)

So in theory the logic is not that hard to replicate in another WebKit-based framework, but as far as I understand electron's architecture, it really doesn't fit - unlike wkhtmltopdf which seems to interact with internal WebKit objects directly, electron uses Chromium IPC - there is a proper headless Chromium process running in the background, and the "manager" process sends IPC commands to it, so it's limited to functionality that Chromium exposes that way. (That is a much saner architecture in general, as you get the security and stability guarantees of Chromium, but it's less flexible.)

@Tgr hmmm.... are you saying that you can't easily get the page rects with the current chromium ipc system?

If so, I guess that the solution would be to add that functionality into chromium ipc so we can get that information? And is that extremely difficult?

I also guess that would mean we need to run a custom version chrome to make this work if we need to add functionality like this?

Some of the JS tools I referenced in T134205 did / do pagination in browsers (see "Browser print improvement projects" in the description). The earlier tools used CSS regions, but those were removed from Chrome again. The most up to date tool seems to be vivliostyle, which is discussed in some detail in T135022. For example, here is a pure-JS paginated view of [[Barack Obama]]: http://vivliostyle.github.io/vivliostyle.js/viewer/vivliostyle-viewer.html#x=https://en.wikipedia.org/api/rest_v1/page/html/Barack_Obama&f=epubcfi(/2!)

I still think it is very much worth reaching out to the vivliostyle folks, as setting up generally improved pagination & tocs would benefit both client side & server side printing. For them it would be a great showcase of their open source tool.

In T167210#3335518, @Fjalapeno wrote:

@Tgr hmmm.... are you saying that you can't easily get the page rects with the current chromium ipc system?

That's my quick and rather unreliable assessment of the code. Finding a contact in the Chromium developer community would probably yield a more reliable answer.

If so, I guess that the solution would be to add that functionality into chromium ipc so we can get that information? And is that extremely difficult?
I also guess that would mean we need to run a custom version chrome to make this work if we need to add functionality like this?

Well, the Chromium codebase is about 30x larger than MediaWiki, written in C++ (which also means that, unlike with dynamic languages, it is rather easy to create remote execution vulnerabilities), and none of us has a clue about it. And a local patch would have to be reapplied (and quite possibly rewritten) every time we upgrade. So that's far beyond plausible, IMO. Maybe in the longer term Google or the Electron developer community would be interested to support us by adding this kind of functionality themselves.

There are two other possible approaches:

post-processing the PDF - it's already paginated so parse through, find the headings, record their page number, use the metadata to add an outline and generate a preface with links and page number. In general making sense of a PDF is quite hard; a PDF is basically just a set of line boxes, a separate one for each line, font face, font size etc. with no semantic information so trying to reconstruct which box is the title is very fragile. But since we are the ones generating the PDF in the first place, maybe we can put in some kind of easily identifiable marker - generate a transparent div with a unique ID, positioned above the heading, and hope the browser is un-clever enough to print it, something like that. Or maybe there is some metadata that's preserved when printing to PDF (anchors, maybe).
pre-processing the HTML - format it in a way that's similar to how it will look in print and try to guess where the page borders are. That's just some basic CSS math but unlikely to be reliable since we don't know what exact transformation the browser does on the document before turning it into PDF.

pre-processing the HTML - format it in a way that's similar to how it will look in print and try to guess where the page borders are. That's just some basic CSS math but unlikely to be reliable since we don't know what exact transformation the browser does on the document before turning it into PDF.

Check out the vivlio link I gave above. At least for Obama, the pagination in the preview does match the print output exactly.

In T167210#3335764, @GWicke wrote:

Check out the vivlio link I gave above. At least for Obama, the pagination in the preview does match the print output exactly.

Interesting! At a glance they don't really paginate, they create a bunch of fixed-size page boxes, put a copy of the full Wikipedia page in each one, and use relative positioning to show a different segment of the page in each box. But it must be more clever than that because they never cut lines.

(On the wider point, I agree collaborating more with developer communities and Not Inventing Here would be cool.)

In T167210#3335926, @Tgr wrote:

In T167210#3335764, @GWicke wrote:

Check out the vivlio link I gave above. At least for Obama, the pagination in the preview does match the print output exactly.

Interesting! At a glance they don't really paginate, they create a bunch of fixed-size page boxes, put a copy of the full Wikipedia page in each one, and use relative positioning to show a different segment of the page in each box. But it must be more clever than that because they never cut lines.

I inspected the source a bit, and it looked like they did actually cut up the content into the boxes. Only the top-level wrapper element attributes seemed to be repetitive. There were no iframes or the like. With each box sized to fill one page, this would explain why no lines are cut.

However, the way long tables are cut off does not look quite optimal yet. They seem to be missing code that re-opens the table with the remaining content in the next page box, potentially repeating table headings. I don't see anything in their approach that would prevent this to be added though, and the result would be an improvement over the arbitrary (often mid-line) cut most browsers implement for printing.

Marking up the PDF is possible but highly impractical.

I prepended a fake link to the h2 tag: <h2><a href="#electron-toc-Notes"> </a><span id="electron-toc-Notes"></span>... (Chrome seems clever enough to filter out links which do not have a corresponding anchor, or which are completely invisible, e.g. only contain a zero-width character.), generated the PDF with <h2><a href="#electron-toc-Notes"> </a><span id="electron-toc-Notes"></span> (using https://gerrit.wikimedia.org/r/#/c/356991/ ), and could extract the page number with smalot/pdfparser:

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile( 'Berlin-electron.pdf' );
$pages  = $pdf->getPages();

foreach ( $pages as $pageNum => $page ) {
    $details = $page->getHeader()->getDetails();
    $annotations = isset( $details['Annots'] ) ? $details['Annots'] : [];
    foreach ( $annotations as $annotation ) {
        if ( $annotation['Subtype'] === 'Link' && isset( $annotation['Dest'] ) ) {
            $dest = $annotation['Dest'];
            if ( preg_match( '/^electron-toc-/', $dest ) ) {
                echo preg_replace( '/^electron-toc-/', '', $dest ) . ': ' . ( $pageNum + 1 ) . PHP_EOL;
            }
        }
    }
}

Actually there is a much simpler way. PDF documents contain the targets of internal links (such as the TOC link) as Destination objects (PDF spec 12.3.2.2) with the id value as the name, and the Destination object includes a reference to the parent page. So given an id (which is already provided by MCS for example) one can easily extract the page number:

$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile( 'Berlin-electron.pdf' );
$pages  = $pdf->getPages();

$lastObject = end( $pdf->getObjects() );
$page = $lastObject->getHeader()->get( $id )->getContent()[0];
$pageNumber = array_search( $page, $pages ) + 1;

echo "page: $pages\n";

So this would allow the extraction of page numbers of headings from the PDF, given the TOC ids of the original page. It's still impractical because it would force a two-step PDF generation plus it would probably require some awkward changes to the API of the electron service (change it to multipart/form-data so the TOC can be passed, or something like that). Nevertheless, it's doable in a bind.

One aspect I find attractive about doing the pagination client side is that it would also benefit the (probably larger number of) people printing things from their browser. There is no need to re-download the content, and printing would even work from a web app in offline mode.

Tgr moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jun 14 2017, 3:50 PM

Jdlrobson moved this task from Needs Prioritization to Epics/Goals on the Web-Team-Backlog board.Jun 21 2017, 11:59 PM

Tgr added subtasks: T168004: [Spike 6hrs] Investigate ability of vivliostyle to render single articles, T168871: Introduct toc with page numbers during pdf post-processing.Jul 4 2017, 6:45 PM

In T167210#3348595, @GWicke wrote:

One aspect I find attractive about doing the pagination client side is that it would also benefit the (probably larger number of) people printing things from their browser. There is no need to re-download the content, and printing would even work from a web app in offline mode.

Agreed in theory; Vivliostyle Viewer has a very heavy UI though and it would probably be very confusing to users. Plus it's a single page app so we'd need to direct users there before printing.
(In theory, vivliostyle.js can be used on its own, with no UI, but 1) that does not work out of the box, 2) not sure when or how it would be set up as there is no equivalent of @print in Javascript, and beforeprint is not async.)

Tgr added a subtask: T169897: Track print-related web standards.Jul 10 2017, 11:03 AM

Work on this happened in subtasks:

• Fjalapeno edited projects, added Product-Infrastructure-Team-Backlog-Deprecated; removed Product-Infrastructure-Team-Backlog-Deprecated (Kanban).Jul 10 2017, 5:55 PM

• Fjalapeno moved this task from Needs triage to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.

• Fjalapeno unsubscribed.

• Fjalapeno moved this task from Tracking to Epics on the Product-Infrastructure-Team-Backlog-Deprecated board.Jul 13 2017, 2:24 PM

ovasileva closed subtask T168871: Introduct toc with page numbers during pdf post-processing as Resolved.Jul 25 2017, 12:13 PM

ovasileva changed the status of subtask T168004: [Spike 6hrs] Investigate ability of vivliostyle to render single articles from Open to Stalled.Jul 26 2017, 5:25 PM

phuedx closed subtask T168004: [Spike 6hrs] Investigate ability of vivliostyle to render single articles as Resolved.Aug 23 2017, 9:10 AM

• bmansurov unsubscribed.Dec 22 2017, 9:46 PM

Liuxinyu970226 subscribed.Dec 30 2017, 2:15 AM

Jdlrobson renamed this task from [Spike] Investigate adding PDF TOC with PDF page numbers to electron to [EPIC] Adding PDF TOC with PDF page numbers to electron.Feb 5 2018, 6:40 PM

ovasileva moved this task from Epics/Goals to Product Owner Backlog on the Web-Team-Backlog board.Feb 13 2018, 6:51 PM

Jdlrobson moved this task from Product Owner Backlog to Tracking on the Web-Team-Backlog board.Feb 15 2018, 12:42 AM

Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.

Jdlrobson moved this task from Untriaged to Discuss further on the Web-Team-Backlog (Tracking) board.Feb 15 2018, 12:47 AM

Closing as per T184772#4116906. Pediapress will be taking on books functionality from this point forward.

Restricted Application removed a subscriber: Liuxinyu970226. · View Herald TranscriptApr 9 2018, 2:38 PM

TheDJ merged a task: T183104: Transform TOC into PDF outline on export.Jan 17 2020, 12:31 PM

TheDJ added subscribers: Volker_E, MJL, Kaartic and 2 others.

Izno removed a parent task: T166188: Architecture of new rendering backend for Extension:Collection.Dec 8 2020, 3:40 AM

Izno removed a subtask: T169897: Track print-related web standards.

[EPIC] Adding PDF TOC with PDF page numbers to electronClosed, InvalidPublicActions

Description

Related ObjectsSearch...

Event Timeline

[EPIC] Adding PDF TOC with PDF page numbers to electron
Closed, InvalidPublic
Actions

Related Objects
Search...