Page MenuHomePhabricator

[Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies
Closed, ResolvedPublic

Description

Background

Ideally, we'd like headless Chromium and post-processing to use a single service rather than having the post-processing steps have their own python service.

In T175853: [Spike 16hr] Investigate the ability of Python wrapped headless Chromium to render large books we found that it's best to interact with headless Chromium using the official JS library. Thus we'd like to investigate Node.js libraries that allow us to manipulate PDF's. We also want to look at libraries that are written in JS only. There was a push back from the ops and services when we wanted to use wkhtmltopdf to render PDFs partly because it was written in C++. Some of the reasons given were: (a) hard to maintain, (b) security risk, (c) if something goes wrong, we don't have C++ developers handy to fix the issue with the underlying library.

A/C

Find a Node.js library that has the capability to do the following (in addition to being written in JS only and not offloading the work to external programs):

  • Add page numbers to PDF pages;
  • Add pages to a PDF;
  • Remove pages from a PDF;
  • Given the table of contents with links that point to headings in the PDF, find page numbers of headings in the PDF;
  • Add an outline;
  • Add metadata such as the author, title, etc.

Related Objects

StatusSubtypeAssignedTask
Resolved JKatzWMF
ResolvedTheDJ
DuplicateNone
Resolved JKatzWMF
DeclinedNone
Resolvedpmiazga
InvalidNone
ResolvedSpikephuedx
InvalidNone
InvalidNone
StalledNone
InvalidNone
DuplicateNone
DeclinedNone
InvalidNone
InvalidNone
Resolved bmansurov
Invalidovasileva
Resolvedpmiazga
Resolvedphuedx
Resolvedovasileva
Invalidovasileva
Resolvedphuedx
Resolvedphuedx
ResolvedJdlrobson
InvalidNone
Resolvedovasileva
InvalidNone
InvalidNone
Resolved dpatrick
InvalidSpikeNone
ResolvedJdlrobson

Event Timeline

@bmansurov, @phuedx to update task description

bmansurov renamed this task from [Spike] Investigate libraries for post-processing with fewer dependencies to [Spike] Investigate libraries for post-processing without non-JS dependencies.Sep 22 2017, 2:14 PM
bmansurov updated the task description. (Show Details)
ovasileva renamed this task from [Spike] Investigate libraries for post-processing without non-JS dependencies to [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies.Sep 26 2017, 4:38 PM

I mainly considered libraries that have reached the version 1.0 or more (not that there were great libraries that didn't reach that version number). I didn't consider commercially licensed libraries such as jsreport.

There are many JS projects that deal with PDFs on github.com or npmjs.com, and here are the most relevant ones:

  • jsPDF has support for creating PDFs, but doesn't seem to support manipulating existing PDFs.
  • pdf.js doesn't support editing PDFs, it is just for reading.
  • pdfkit is focused on creating documents from scratch, rather than allowing to query them or manipulate their pages.
  • pdfmake is based on pdfkit, and has more or less the same features (I couldn't gather the differences, but it certainly doesn't have all the features that we need).

Verdict
Looks like there are no open source pure JS libraries that satisfy the requirements of the task. There may be some in the glorious future though.

I also found PDFNetJS - looks promising, it allows to:

  • merge pdf
  • remove pages
  • edit pdf's by adding text/content, changing existing text etc

But

  • commercial - requires license
  • pretty slow
  • targes browsers, it should be possible to use it in node but it requires testing
  • it's not pure JS, it looks like it's compiled from LLVM (so it might be C or anything else)
  • library is a single 24MB minified/obfuscated file JS file, not possible to do a security review of that

As @bmansurov says, currently there are no good NodeJS libraries to manipulate the PDF files easily via node. The existing libs allow us or to create the PDF file from scratch. Libraries which allow PDF editing provide only functions to inject/edit text nodes, drawing lines/other shapes. not a comfy HTML -> pdf conversion.

Also, it's important to mention that many libraries are maintained by single people (for example most of them have many open issues in GitHub)

. Thus we'd like to investigate Node.js libraries that allow us to manipulate PDF's.

I'm a little confused, which is probably because I took a long time to read through this and our direction has changed a lot since this spike. Are we still looking to do concatenation in light of the struggles we've hit in T178095 ? Why are we looking to manipulate PDFs rather than manipulate HTML before generating the PDF? What is the alternative we are looking at in a different language? What can that do that Node.js libraries can't?

jsPDF has support for creating PDFs, but doesn't seem to support manipulating existing PDFs.

But couldn't we do the manipulation before generating the PDF?

Looks like there are no open source pure JS libraries that satisfy the requirements of the task. There may be some in the glorious future though.

From what i can see, the outline could be generated in Node.js as could the table of contents. The domino library for example allows us to turn a document into a DOM and manipulate it (and thus produce a table of contents). We could even use the existing table of contents in a page to aid this process. e.g. extract the #toc elements and combine them all, locating the hash fragment heading and updating the label.

Is it fairer to say the problems with using JS libraries to satisfy the requirements hinge on the adding of page numbers to PDF pages? Or have I completely misunderstood the goal of this task?

It might be worth considering https://github.com/marcbachmann/node-html-pdf ? It seems the most popular on libraries.io

I think it would be worth talking about this a little more and experimenting with creating a concatenator in Node.js if we want to go down that route, at least for my benefit, but I'm not sure how pressing this conversation is at the moment.

@Jdlrobson I don't think we're concerned with generating the table of contents itself. It's that we want to be able to add page numbers to the table of contents, and other things mentioned in the A/C. These things don't seem to be easily done in Node.js. The link you pasted seems like a replacement for headless chromium, and not something that allows us to manipulate PDF files. And we need to manipulate PDFs rather than HTMLs because laying out pages in HTML and getting page numbers of sections don't seem possible. We tried it in T168871#3439115 and in the end that solution didn't work out well. Some other things as adding an outline to the PDF is also done when PDF is ready, not before it's ready with the tools we are using.

Got it, so it sounds like actually the issue is with accessing and displaying page numbers inside the the created pdf.
So yeh, having no node.js libraries around to do that makes sense.

Thanks for the context about attempting to do this in HTML that's useful.

It sounds like we'll either need to reconsider this requirement or use something not built in JavaScript. So yes, the conclusion makes sense to me.

That said, node code/libraries could be used for Add an outline and Add metadata such as the author, title, etc