[Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ovasileva
	Sep 22 2017, 9:52 AM

Description

Background

Ideally, we'd like headless Chromium and post-processing to use a single service rather than having the post-processing steps have their own python service.

In T175853: [Spike 16hr] Investigate the ability of Python wrapped headless Chromium to render large books we found that it's best to interact with headless Chromium using the official JS library. Thus we'd like to investigate Node.js libraries that allow us to manipulate PDF's. We also want to look at libraries that are written in JS only. There was a push back from the ops and services when we wanted to use wkhtmltopdf to render PDFs partly because it was written in C++. Some of the reasons given were: (a) hard to maintain, (b) security risk, (c) if something goes wrong, we don't have C++ developers handy to fix the issue with the underlying library.

A/C

Find a Node.js library that has the capability to do the following (in addition to being written in JS only and not offloading the work to external programs):

Add page numbers to PDF pages;
Add pages to a PDF;
Remove pages from a PDF;
Given the table of contents with links that point to headings in the PDF, find page numbers of headings in the PDF;
Add an outline;
Add metadata such as the author, title, etc.

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		• JKatzWMF	T150871 [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
Resolved		TheDJ	T150872 Replace OCG in collection extension with Electron
Duplicate		None	T150874 Collate wikimedia pages into a single html wikimedia page that can then be rendered into a single pdf
Resolved		• JKatzWMF	T150875 Confirm attribution needs
Declined		None	T150917 Remove deprecated features from book creator UI
Resolved		pmiazga	T163272 [Spike] Determine changes necessary for concatenation support
Invalid		None	T167210 [EPIC] Adding PDF TOC with PDF page numbers to electron
Resolved	Spike	phuedx	T168004 [Spike 6hrs] Investigate ability of vivliostyle to render single articles
Invalid		None	T169757 Improve usability of "download as pdf" page
Invalid		None	T186740 [EPIC] It should be possible to print a book using the Proton service
Stalled		None	T174670 Remove banner from saved books
Invalid		None	T171832 Deploy new book renderer to all projects
Duplicate		None	T171833 Deploy new book renderer to all projects side by side with OCG
Declined		None	T173018 Add an option in Special:Book to download PDFs generated by ElectronPdfService
Invalid		None	T173015 Use PDF post-processing service to generate final PDF
Invalid		None	T173579 Expose PDF post-processing scripts as a stateless web service
Resolved		• bmansurov	T171965 [Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection
Invalid		ovasileva	T171960 Create a library to post-process PDF and add page numbers and table of contents
Resolved		pmiazga	T171838 Build out article concatenation according to requirements for books
Resolved		phuedx	T171964 [Spike - 8 hrs] Where should article concatenation be implemented?
Resolved		ovasileva	T175856 Implement changes to article concatenation based on books requirements
Invalid		ovasileva	T177805 [Spike] How do we render contributors and images section of books accurately?
Resolved		phuedx	T177672 Collection tests do not run properly
Resolved		phuedx	T177801 Collection phpunit tests are failing for table of contents when run locally
Resolved		Jdlrobson	T177892 PDF table of contents styling font-size is inconsistent
Invalid		None	T177993 Article concatenation fails on large books
Resolved		ovasileva	T177994 Book generation fails for articles with '/' character in title
Invalid		None	T177996 Article concatenation not resilient to curl errors
Invalid		None	T182230 [Spike] Explore ways of creating a stateless web service in Python
Resolved		• dpatrick	T173014 Security review of pdfrw
Invalid	Spike	None	T169738 [Spike 8hrs] Investigate ability of using post-processing approach with new print styles
Resolved		Jdlrobson	T176463 [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies

Event Timeline

ovasileva created this task.Sep 22 2017, 9:52 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 22 2017, 9:52 AM

@bmansurov, @phuedx to update task description

ovasileva mentioned this in T175853: [Spike 16hr] Investigate the ability of Python wrapped headless Chromium to render large books.Sep 22 2017, 9:53 AM

ovasileva added a project: Web-Team-Backlog.Sep 22 2017, 12:47 PM

ovasileva moved this task from Incoming to Upcoming on the Web-Team-Backlog board.

ovasileva mentioned this in T171960: Create a library to post-process PDF and add page numbers and table of contents.

ovasileva added a parent task: T171960: Create a library to post-process PDF and add page numbers and table of contents.

ovasileva mentioned this in T173579: Expose PDF post-processing scripts as a stateless web service.Sep 22 2017, 12:49 PM

ovasileva mentioned this in T173015: Use PDF post-processing service to generate final PDF.

• bmansurov renamed this task from [Spike] Investigate libraries for post-processing with fewer dependencies to [Spike] Investigate libraries for post-processing without non-JS dependencies.Sep 22 2017, 2:14 PM

• bmansurov updated the task description. (Show Details)

ovasileva added a parent task: T169738: [Spike 8hrs] Investigate ability of using post-processing approach with new print styles.Sep 22 2017, 2:59 PM

ovasileva renamed this task from [Spike] Investigate libraries for post-processing without non-JS dependencies to [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies.Sep 26 2017, 4:38 PM

Jdlrobson awarded a token.Sep 26 2017, 4:41 PM

ovasileva added a project: Readers-Web-Kanbanana-Board-Old.Sep 26 2017, 4:56 PM

Jdlrobson moved this task from Upcoming to 2017-18 Q1 on the Web-Team-Backlog board.Sep 26 2017, 7:47 PM

• Niedzielski added a project: Spike.Sep 27 2017, 5:09 PM

• bmansurov claimed this task.Sep 29 2017, 2:14 PM

• bmansurov moved this task from To Do to Doing on the Readers-Web-Kanbanana-Board-Old board.

• bmansurov updated the task description. (Show Details)

• bmansurov updated the task description. (Show Details)Sep 29 2017, 4:21 PM

I mainly considered libraries that have reached the version 1.0 or more (not that there were great libraries that didn't reach that version number). I didn't consider commercially licensed libraries such as jsreport.

There are many JS projects that deal with PDFs on github.com or npmjs.com, and here are the most relevant ones:

jsPDF has support for creating PDFs, but doesn't seem to support manipulating existing PDFs.
pdf.js doesn't support editing PDFs, it is just for reading.
pdfkit is focused on creating documents from scratch, rather than allowing to query them or manipulate their pages.
pdfmake is based on pdfkit, and has more or less the same features (I couldn't gather the differences, but it certainly doesn't have all the features that we need).

Verdict
Looks like there are no open source pure JS libraries that satisfy the requirements of the task. There may be some in the glorious future though.

I also found PDFNetJS - looks promising, it allows to:

merge pdf
remove pages
edit pdf's by adding text/content, changing existing text etc

But

commercial - requires license
pretty slow
targes browsers, it should be possible to use it in node but it requires testing
it's not pure JS, it looks like it's compiled from LLVM (so it might be C or anything else)
library is a single 24MB minified/obfuscated file JS file, not possible to do a security review of that

As @bmansurov says, currently there are no good NodeJS libraries to manipulate the PDF files easily via node. The existing libs allow us or to create the PDF file from scratch. Libraries which allow PDF editing provide only functions to inject/edit text nodes, drawing lines/other shapes. not a comfy HTML -> pdf conversion.

Also, it's important to mention that many libraries are maintained by single people (for example most of them have many open issues in GitHub)

pmiazga moved this task from Needs Code Review to Ready for Signoff on the Readers-Web-Kanbanana-Board-Old board.Oct 3 2017, 8:04 PM

• bmansurov removed • bmansurov as the assignee of this task.Oct 3 2017, 8:11 PM

MBinder_WMF assigned this task to Jdlrobson.Oct 4 2017, 5:17 PM

MBinder_WMF added a project: Electron-PDFs.Oct 9 2017, 7:25 PM

MBinder_WMF moved this task from 2017-18 Q1 to 2017-18 Q2 on the Web-Team-Backlog board.Oct 10 2017, 7:00 PM

. Thus we'd like to investigate Node.js libraries that allow us to manipulate PDF's.

I'm a little confused, which is probably because I took a long time to read through this and our direction has changed a lot since this spike. Are we still looking to do concatenation in light of the struggles we've hit in T178095 ? Why are we looking to manipulate PDFs rather than manipulate HTML before generating the PDF? What is the alternative we are looking at in a different language? What can that do that Node.js libraries can't?

jsPDF has support for creating PDFs, but doesn't seem to support manipulating existing PDFs.

But couldn't we do the manipulation before generating the PDF?

Looks like there are no open source pure JS libraries that satisfy the requirements of the task. There may be some in the glorious future though.

From what i can see, the outline could be generated in Node.js as could the table of contents. The domino library for example allows us to turn a document into a DOM and manipulate it (and thus produce a table of contents). We could even use the existing table of contents in a page to aid this process. e.g. extract the #toc elements and combine them all, locating the hash fragment heading and updating the label.

Is it fairer to say the problems with using JS libraries to satisfy the requirements hinge on the adding of page numbers to PDF pages? Or have I completely misunderstood the goal of this task?

It might be worth considering https://github.com/marcbachmann/node-html-pdf ? It seems the most popular on libraries.io

I think it would be worth talking about this a little more and experimenting with creating a concatenator in Node.js if we want to go down that route, at least for my benefit, but I'm not sure how pressing this conversation is at the moment.

@Jdlrobson I don't think we're concerned with generating the table of contents itself. It's that we want to be able to add page numbers to the table of contents, and other things mentioned in the A/C. These things don't seem to be easily done in Node.js. The link you pasted seems like a replacement for headless chromium, and not something that allows us to manipulate PDF files. And we need to manipulate PDFs rather than HTMLs because laying out pages in HTML and getting page numbers of sections don't seem possible. We tried it in T168871#3439115 and in the end that solution didn't work out well. Some other things as adding an outline to the PDF is also done when PDF is ready, not before it's ready with the tools we are using.

Got it, so it sounds like actually the issue is with accessing and displaying page numbers inside the the created pdf.
So yeh, having no node.js libraries around to do that makes sense.

Thanks for the context about attempting to do this in HTML that's useful.

It sounds like we'll either need to reconsider this requirement or use something not built in JavaScript. So yes, the conclusion makes sense to me.

That said, node code/libraries could be used for Add an outline and Add metadata such as the author, title, etc

Jdlrobson closed this task as Resolved.Nov 1 2017, 6:54 PM

[Spike 8hrs] Investigate libraries for post-processing without non-JS dependenciesClosed, ResolvedPublicActions

Description

Background

A/C

Related ObjectsSearch...

Event Timeline

[Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies
Closed, ResolvedPublic
Actions

Related Objects
Search...