Maniphest T171960

Create a library to post-process PDF and add page numbers and table of contents
Closed, InvalidPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	ovasileva
	Jul 28 2017, 3:10 PM

Description

We want to create a module that takes a PDF and adds page numbers and the table of contents to it.

A proof of concept scripts (both in PHP and Python) have been created in T168871. We will use Python because of the availability of a good third party PDF processing library called pdfrw.

A/C

Script takes a PDF and data for table of contents.
Script adds page numbers to the PDF
Script creates the table of contents with page numbers and adds it to the PDF.

WIP Post processor script:
https://github.com/kodchi/ppg

T183104

Related Objects
Search...

Status	Assigned	Task
Resolved	• JKatzWMF	T150871 [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
Invalid	None	T186740 [EPIC] It should be possible to print a book using the Proton service
Stalled	None	T174670 Remove banner from saved books
Invalid	None	T171832 Deploy new book renderer to all projects
Duplicate	None	T171833 Deploy new book renderer to all projects side by side with OCG
Declined	None	T173018 Add an option in Special:Book to download PDFs generated by ElectronPdfService
Invalid	None	T173015 Use PDF post-processing service to generate final PDF
Invalid	None	T173579 Expose PDF post-processing scripts as a stateless web service
Resolved	• bmansurov	T171965 [Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection
Invalid	ovasileva	T171960 Create a library to post-process PDF and add page numbers and table of contents
Resolved	ovasileva	T168871 Introduct toc with page numbers during pdf post-processing
Resolved	Jdlrobson	T176463 [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies
Resolved	pmiazga	T171838 Build out article concatenation according to requirements for books
Resolved	phuedx	T171964 [Spike - 8 hrs] Where should article concatenation be implemented?
Resolved	ovasileva	T175856 Implement changes to article concatenation based on books requirements
Invalid	ovasileva	T177805 [Spike] How do we render contributors and images section of books accurately?
Resolved	phuedx	T177672 Collection tests do not run properly
Resolved	phuedx	T177801 Collection phpunit tests are failing for table of contents when run locally
Resolved	Jdlrobson	T177892 PDF table of contents styling font-size is inconsistent
Invalid	None	T177993 Article concatenation fails on large books
Resolved	ovasileva	T177994 Book generation fails for articles with '/' character in title
Invalid	None	T177996 Article concatenation not resilient to curl errors
Invalid	None	T182230 [Spike] Explore ways of creating a stateless web service in Python
Resolved	• dpatrick	T173014 Security review of pdfrw

Event Timeline

ovasileva renamed this task from Add post-processing to PDF concatenation to Add post-processing to PDF concatenation for books.Jul 28 2017, 3:10 PM

ovasileva created this task.

• bmansurov renamed this task from Add post-processing to PDF concatenation for books to Create a module to post-process PDF and add page numbers and table of contents.Jul 28 2017, 3:18 PM

• bmansurov updated the task description. (Show Details)

• bmansurov mentioned this in T171965: [Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection.Jul 28 2017, 3:55 PM

Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.Jul 31 2017, 5:30 PM

Jdlrobson moved this task from Untriaged to Move to Backlog on the Web-Team-Backlog (Tracking) board.Aug 1 2017, 8:54 PM

A proof of concept scripts (both in PHP and Python) have been created in T168871. We may need to use Python because of the availability of a good third party PDF processing library.

T171960 states that we'll be using Python. Is this correct?

I was hoping so.

• bmansurov updated the task description. (Show Details)Aug 10 2017, 2:34 PM

• bmansurov updated the task description. (Show Details)

phuedx renamed this task from Create a module to post-process PDF and add page numbers and table of contents to Create a process to post-process PDF and add page numbers and table of contents.Aug 10 2017, 2:36 PM

phuedx renamed this task from Create a process to post-process PDF and add page numbers and table of contents to Create a library to post-process PDF and add page numbers and table of contents.Aug 10 2017, 2:42 PM

• bmansurov mentioned this in T173014: Security review of pdfrw.Aug 10 2017, 4:02 PM

• bmansurov created subtask T173014: Security review of pdfrw.

• bmansurov mentioned this in T171964: [Spike - 8 hrs] Where should article concatenation be implemented?.Aug 17 2017, 2:39 PM

• bmansurov mentioned this in T173579: Expose PDF post-processing scripts as a stateless web service.Aug 18 2017, 4:19 PM

• bmansurov removed a parent task: T171838: Build out article concatenation according to requirements for books.Aug 18 2017, 4:31 PM

• bmansurov removed a subtask: T173014: Security review of pdfrw.

• bmansurov added a subtask: T168871: Introduct toc with page numbers during pdf post-processing.

ovasileva set the point value for this task to 8.Aug 23 2017, 5:51 PM

ovasileva added a project: Readers-Web-Kanbanana-Board-Old.

ovasileva moved this task from To Do to Needs Design Review on the Readers-Web-Kanbanana-Board-Old board.Aug 23 2017, 5:54 PM

• bmansurov claimed this task.Aug 24 2017, 1:54 PM

• bmansurov moved this task from Needs Design Review to Doing on the Readers-Web-Kanbanana-Board-Old board.

I've started adding a post-processor script at https://github.com/kodchi/ppg.

• bmansurov moved this task from Doing to Needs Code Review on the Readers-Web-Kanbanana-Board-Old board.Aug 28 2017, 10:13 PM

TheDJ subscribed.Aug 29 2017, 2:06 PM

This is looking really awesome I must say, much more sustainable than patching up external libraries. It seems the underlying library can also be used to modify the metadata of the generated PDF.

I strongly suggest we make use of those features as well, so that we could have a structured outline of the document (aka. bookmarks) and be able to set the file's Title, Author, Subject, etc fields. Adding metadata is a great way to make PDFs discoverable in search engines, and is also handy for licensing and/or copyright violation detection etc. And esp. the outline/bookmarks metadata has been requested by users multiple times.

This https://bobbielf2.github.io/blog/2017/04/11/preserve-the-table-of-contents-when-converting-a-book-from-djvu-to-pdf/
shows how someone did this for djvu conversions.

These elements are not currently part of the acceptance criteria, so either we should amend those, or we could create separate tickets for those.

Yeah, outline/metadata would be nice. It's fairly easy to do (8.2.2 of the PDF spec has the details, but basically just build a tree structure of the ToC), much easier than generating document content.

Any plans on (eventually) making this into something that's generally useful for people post-processing Chrome PDF renderings (whether generated by Extension:Collection or not) and submitting as a pypi package? That would simplify installation and updates as it could simply be managed as a system (or virtualenv) package via pip. Plus it would maximize 3rd-party impact.

Yes, definitely, these are great suggestions and I think we should add these features over time. Creating a PyPI package sounds good too. I still need to figure out how to make the new table of contents links clickable that point to the rest of the PDF headings. Also, the approach with adding page numbers needs some more thinking. There are other improvements that we need to make, but in general, the current iteration can be used for our purposes with slight modifications.

It might be easier to keep the HTML TOC and only add in the numbers (mostly because that allows easy redesigning of the TOC via CSS), but if you want to do it by hand, you need to add a link annotation (see spec) with the exact same position as the text box. Which is hard if you generate the TOC from HTML in Python, since you probably don't have access to that position.

That's what I'm doing. Even after that, the links are gone when I insert the TOC PDF into the book PDF.

TheDJ updated the task description. (Show Details)Aug 30 2017, 11:05 AM

ovasileva added a project: Proton.Aug 30 2017, 4:56 PM

ovasileva moved this task from Triage to Current Sprint on the Proton board.Aug 30 2017, 5:00 PM

I've added the ability to generate metadata and outlines. I still need to clean up some stuff, but the output is looking good.

I'm getting a little confused now. https://gerrit.wikimedia.org/r/#/c/361453 also adds a table of contents. Is whether this happens in Python or PHP still up for discussion or is the PHP table of contents output used by the Python script?

I can't really commit to reviewing and reading through and understanding the Python script right now, I have my work cut out understand all the PHP code :/

Yes, we'll use the table of contents generated by PHP and add page numbers using python by looking at the PDF generated by Electron. No worries about reviewing, I'm trying to give us a head start. I think our goal is to first get concatenation delivered ASAP.

• MZMcBride subscribed.Sep 5 2017, 9:40 PM

ovasileva moved this task from Current Sprint to Backlog on the Proton board.Sep 6 2017, 6:57 PM

ovasileva moved this task from Backlog to Current Sprint on the Proton board.Sep 11 2017, 11:57 AM

MBinder_WMF removed • bmansurov as the assignee of this task.Sep 13 2017, 5:10 PM

ovasileva edited projects, added Web-Team-Backlog; removed Readers-Web-Kanbanana-Board-Old, Web-Team-Backlog (Tracking).Sep 20 2017, 5:09 PM

ovasileva moved this task from Incoming to Upcoming on the Web-Team-Backlog board.

ovasileva moved this task from Upcoming to Needs Prioritization on the Web-Team-Backlog board.Sep 22 2017, 12:46 PM

Moving out of sprint and marking as stalled until the completion of T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies

ovasileva added a subtask: T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies.Sep 22 2017, 12:48 PM

• bmansurov mentioned this in T178077: Security review of Beautiful Soup.Oct 12 2017, 3:15 PM

Jdlrobson closed subtask T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies as Resolved.Nov 1 2017, 6:54 PM

In T176463: [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies we found out that no pure JS library can do post-processsing. So we'll keep improving the post-processing script written in Python.

• bmansurov added a parent task: T173579: Expose PDF post-processing scripts as a stateless web service.Dec 6 2017, 6:46 PM

• bmansurov mentioned this in T182230: [Spike] Explore ways of creating a stateless web service in Python.Dec 6 2017, 7:08 PM

• bmansurov unsubscribed.Dec 22 2017, 9:47 PM

Is this still in scope or is the focus right now to get PDFs generating with as little post-processing as possible?

Jdlrobson updated the task description. (Show Details)Jan 31 2018, 5:52 PM

Jdlrobson added a parent task: T186740: [EPIC] It should be possible to print a book using the Proton service.Feb 7 2018, 7:32 PM

Jdlrobson moved this task from Product Owner Backlog to Tracking on the Web-Team-Backlog board.

Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.

Closing as per T184772#4116906. Pediapress will be taking on rendering of PDF books.

Create a library to post-process PDF and add page numbers and table of contentsClosed, InvalidPublic8 Estimated Story PointsActions

Description

A/C

Related

Related ObjectsSearch...

Event Timeline

Create a library to post-process PDF and add page numbers and table of contents
Closed, InvalidPublic8 Estimated Story Points
Actions

Related Objects
Search...