Page MenuHomePhabricator

Create a library to post-process PDF and add page numbers and table of contents
Closed, InvalidPublic8 Estimate Story Points

Description

We want to create a module that takes a PDF and adds page numbers and the table of contents to it.

A proof of concept scripts (both in PHP and Python) have been created in T168871. We will use Python because of the availability of a good third party PDF processing library called pdfrw.

A/C

  • Script takes a PDF and data for table of contents.
  • Script adds page numbers to the PDF
  • Script creates the table of contents with page numbers and adds it to the PDF.

WIP Post processor script:
https://github.com/kodchi/ppg

Related

T183104

Related Objects

StatusSubtypeAssignedTask
ResolvedJKatzWMF
InvalidNone
StalledNone
InvalidNone
DuplicateNone
DeclinedNone
InvalidNone
InvalidNone
Resolvedbmansurov
Invalidovasileva
Resolvedovasileva
ResolvedJdlrobson
Resolvedpmiazga
Resolvedphuedx
Resolvedovasileva
Invalidovasileva
Resolvedphuedx
Resolvedphuedx
ResolvedJdlrobson
InvalidNone
Resolvedovasileva
InvalidNone
InvalidNone
Resolved dpatrick

Event Timeline

ovasileva renamed this task from Add post-processing to PDF concatenation to Add post-processing to PDF concatenation for books.Jul 28 2017, 3:10 PM
ovasileva created this task.
bmansurov renamed this task from Add post-processing to PDF concatenation for books to Create a module to post-process PDF and add page numbers and table of contents.Jul 28 2017, 3:18 PM
bmansurov updated the task description. (Show Details)
phuedx added a subscriber: phuedx.

A proof of concept scripts (both in PHP and Python) have been created in T168871. We may need to use Python because of the availability of a good third party PDF processing library.

T171960 states that we'll be using Python. Is this correct?

I was hoping so.

bmansurov updated the task description. (Show Details)Aug 10 2017, 2:34 PM
bmansurov updated the task description. (Show Details)
phuedx renamed this task from Create a module to post-process PDF and add page numbers and table of contents to Create a process to post-process PDF and add page numbers and table of contents.Aug 10 2017, 2:36 PM
phuedx renamed this task from Create a process to post-process PDF and add page numbers and table of contents to Create a library to post-process PDF and add page numbers and table of contents.Aug 10 2017, 2:42 PM
ovasileva set the point value for this task to 8.Aug 23 2017, 5:51 PM

I've started adding a post-processor script at https://github.com/kodchi/ppg.

TheDJ added a subscriber: TheDJ.Aug 29 2017, 2:06 PM
TheDJ added a comment.EditedAug 29 2017, 2:28 PM

This is looking really awesome I must say, much more sustainable than patching up external libraries. It seems the underlying library can also be used to modify the metadata of the generated PDF.

I strongly suggest we make use of those features as well, so that we could have a structured outline of the document (aka. bookmarks) and be able to set the file's Title, Author, Subject, etc fields. Adding metadata is a great way to make PDFs discoverable in search engines, and is also handy for licensing and/or copyright violation detection etc. And esp. the outline/bookmarks metadata has been requested by users multiple times.

This https://bobbielf2.github.io/blog/2017/04/11/preserve-the-table-of-contents-when-converting-a-book-from-djvu-to-pdf/
shows how someone did this for djvu conversions.

These elements are not currently part of the acceptance criteria, so either we should amend those, or we could create separate tickets for those.

Tgr added a subscriber: Tgr.Aug 29 2017, 8:26 PM

Yeah, outline/metadata would be nice. It's fairly easy to do (8.2.2 of the PDF spec has the details, but basically just build a tree structure of the ToC), much easier than generating document content.

Any plans on (eventually) making this into something that's generally useful for people post-processing Chrome PDF renderings (whether generated by Extension:Collection or not) and submitting as a pypi package? That would simplify installation and updates as it could simply be managed as a system (or virtualenv) package via pip. Plus it would maximize 3rd-party impact.

Yes, definitely, these are great suggestions and I think we should add these features over time. Creating a PyPI package sounds good too. I still need to figure out how to make the new table of contents links clickable that point to the rest of the PDF headings. Also, the approach with adding page numbers needs some more thinking. There are other improvements that we need to make, but in general, the current iteration can be used for our purposes with slight modifications.

Tgr added a comment.Aug 29 2017, 11:15 PM

It might be easier to keep the HTML TOC and only add in the numbers (mostly because that allows easy redesigning of the TOC via CSS), but if you want to do it by hand, you need to add a link annotation (see spec) with the exact same position as the text box. Which is hard if you generate the TOC from HTML in Python, since you probably don't have access to that position.

That's what I'm doing. Even after that, the links are gone when I insert the TOC PDF into the book PDF.

TheDJ updated the task description. (Show Details)Aug 30 2017, 11:05 AM
ovasileva moved this task from Triage to Current Sprint on the Proton board.Aug 30 2017, 5:00 PM

I've added the ability to generate metadata and outlines. I still need to clean up some stuff, but the output is looking good.

I'm getting a little confused now. https://gerrit.wikimedia.org/r/#/c/361453 also adds a table of contents. Is whether this happens in Python or PHP still up for discussion or is the PHP table of contents output used by the Python script?

I can't really commit to reviewing and reading through and understanding the Python script right now, I have my work cut out understand all the PHP code :/

Yes, we'll use the table of contents generated by PHP and add page numbers using python by looking at the PDF generated by Electron. No worries about reviewing, I'm trying to give us a head start. I think our goal is to first get concatenation delivered ASAP.

ovasileva moved this task from Current Sprint to Backlog on the Proton board.Sep 6 2017, 6:57 PM
ovasileva moved this task from Backlog to Current Sprint on the Proton board.Sep 11 2017, 11:57 AM
MBinder_WMF removed bmansurov as the assignee of this task.Sep 13 2017, 5:10 PM
ovasileva changed the task status from Open to Stalled.Sep 22 2017, 12:48 PM

Moving out of sprint and marking as stalled until the completion of T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies

bmansurov changed the task status from Stalled to Open.Dec 6 2017, 6:45 PM

In T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies we found out that no pure JS library can do post-processsing. So we'll keep improving the post-processing script written in Python.

Is this still in scope or is the focus right now to get PDFs generating with as little post-processing as possible?

Jdlrobson updated the task description. (Show Details)Jan 31 2018, 5:52 PM
ovasileva closed this task as Invalid.Apr 9 2018, 2:48 PM

Closing as per T184772#4116906. Pediapress will be taking on rendering of PDF books.