Introduct toc with page numbers during pdf post-processing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ovasileva
	Jun 26 2017, 2:41 PM

Description

Background

The ElectronPDF service does not satisfy all requirements for the books feature, namely allowing books to have a toc that contains page numbers

Acceptance criteria

Investigate how to process and Electron-created concatenated PDF to include the following:

Article concatenation
Page numbers
TOC with page numbers
Title page

note: the pages for the toc do not need to be numbered, page #1 will be the first page of the first article
note: complete requirements for the books feature may be found at the PDF functionality page: https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality#Current_Functionality_Requirements

Details

	Subject	Repo	Branch	Lines +/-
	[WIP] Add TOC and page numbers via PDF post-processing	mediawiki/extensions/Collection	master	+267 -25

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• JKatzWMF	T150871 [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
Resolved	TheDJ	T150872 Replace OCG in collection extension with Electron
Invalid	None	T167210 [EPIC] Adding PDF TOC with PDF page numbers to electron
Invalid	None	T186740 [EPIC] It should be possible to print a book using the Proton service
Stalled	None	T174670 Remove banner from saved books
Invalid	None	T171832 Deploy new book renderer to all projects
Duplicate	None	T171833 Deploy new book renderer to all projects side by side with OCG
Declined	None	T173018 Add an option in Special:Book to download PDFs generated by ElectronPdfService
Invalid	None	T173015 Use PDF post-processing service to generate final PDF
Invalid	None	T173579 Expose PDF post-processing scripts as a stateless web service
Resolved	• bmansurov	T171965 [Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection
Invalid	ovasileva	T171960 Create a library to post-process PDF and add page numbers and table of contents
Resolved	ovasileva	T168871 Introduct toc with page numbers during pdf post-processing
Resolved	pmiazga	T171838 Build out article concatenation according to requirements for books
Resolved	phuedx	T171964 [Spike - 8 hrs] Where should article concatenation be implemented?
Resolved	ovasileva	T175856 Implement changes to article concatenation based on books requirements
Invalid	ovasileva	T177805 [Spike] How do we render contributors and images section of books accurately?
Resolved	phuedx	T177672 Collection tests do not run properly
Resolved	phuedx	T177801 Collection phpunit tests are failing for table of contents when run locally
Resolved	Jdlrobson	T177892 PDF table of contents styling font-size is inconsistent
Invalid	None	T177993 Article concatenation fails on large books
Resolved	ovasileva	T177994 Book generation fails for articles with '/' character in title
Invalid	None	T177996 Article concatenation not resilient to curl errors
Invalid	None	T182230 [Spike] Explore ways of creating a stateless web service in Python
Resolved	• dpatrick	T173014 Security review of pdfrw

Event Timeline

ovasileva created this task.Jun 26 2017, 2:41 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 26 2017, 2:41 PM

ovasileva triaged this task as High priority.Jun 26 2017, 2:41 PM

ovasileva moved this task from Incoming to 2014-15 Q4 on the Web-Team-Backlog board.

ovasileva updated the task description. (Show Details)

ovasileva added a subscriber: • bmansurov.Jun 26 2017, 5:24 PM

Tgr mentioned this in T166188: Architecture of new rendering backend for Extension:Collection.Jun 28 2017, 2:20 PM

Article concatenation can be done in PHP, by generating a concatenated HTML before sending to Electron. (It probably has to be, whether we use Electron or not, as we need to do HTML transformations based on it, such as making sure that two articles with the same section name have non-identical section ids). Title page also can be done that way. https://gerrit.wikimedia.org/r/#/c/361453/ has some PHP code that does both. Here is a book rendered with that patch (which uses Electron): sample.

Page numbers and TOC numbers are the only problematic parts. I am still investigating those. (It's doable in theory but all the PHP libraries turned out to be pretty horrible. There are three free libraries with decent PDF generation abilities: TCPDF, FPDI and ZendPdf. The first two have very limited support for modifying existing PDF files; ZendPdf seems to have the ability but it's unmaintained and poorly documented. We could also do it in Node and hack it into the Electron service, or use a library in an arbitrary language and shell out to it; I haven't looked at those options yet.)

Jdlrobson moved this task from 2014-15 Q4 to Tracking on the Web-Team-Backlog board.Jun 29 2017, 8:20 PM

Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.

Jdlrobson moved this task from Untriaged to Move to Backlog on the Web-Team-Backlog (Tracking) board.

Tgr added a parent task: T167210: [EPIC] Adding PDF TOC with PDF page numbers to electron.Jul 4 2017, 6:45 PM

Change 364137 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/Collection@master] [WIP] Add TOC and page numbers via PDF post-processing

https://gerrit.wikimedia.org/r/364137

gerritbot added a project: Patch-For-Review.Jul 10 2017, 12:18 AM

Added a proof-of-concept patch that adds outline, line numbers and TOC numbers. It is shoddy (uses two different PDF libraries for no good reason) and mostly relies on ZendPdf which is undocumented, unfinished and unmaintained; but it shouldn't be too hard to adapt the logic to some other PDF processing library.

Sample

Tgr mentioned this in T150874: Collate wikimedia pages into a single html wikimedia page that can then be rendered into a single pdf .Jul 10 2017, 11:15 AM

Tgr mentioned this in T167210: [EPIC] Adding PDF TOC with PDF page numbers to electron.

Tgr mentioned this in T150872: Replace OCG in collection extension with Electron.Jul 10 2017, 11:19 AM

• Fjalapeno added a project: Product-Infrastructure-Team-Backlog-Deprecated (Kanban).Jul 10 2017, 5:54 PM

• Fjalapeno moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

Tgr moved this task from Doing to Sign off on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jul 12 2017, 5:08 PM

I'm looking at some Python libraries that can do this. To me this method seems the sanest in terms of maintainability and getting the best results. Because PDF generation will be off to a headless Chrome or Electron, and this extra step will add a table of contents, page numbers, and a cover page.

Pulling this into the sprint. As the vivliostyles solution is currently blocked, we will be looking further into creating page numbers and TOC via post-processing.

• bmansurov claimed this task.Jul 13 2017, 2:15 PM

ovasileva moved this task from To Do to Needs Design Review on the Readers-Web-Kanbanana-Board-Old board.Jul 13 2017, 2:20 PM

• bmansurov moved this task from Needs Design Review to Doing on the Readers-Web-Kanbanana-Board-Old board.Jul 13 2017, 2:53 PM

Jdlrobson moved this task from Incoming to 2016-17 Q4 on the Web-Team-Backlog board.Jul 13 2017, 3:11 PM

ovasileva moved this task from 2016-17 Q4 to 2017-18 Q1 on the Web-Team-Backlog board.Jul 13 2017, 5:40 PM

I think I'm onto something cool. Here's the output. Please enjoy it while I write up how I came up with it.

https://en.wikipedia.org/wiki/Berlin

watermark.Berlin.pdf2 MBDownload

https://en.wikipedia.org/wiki/Trigonometric_functions

watermark.Trigonometric_functions.pdf1 MBDownload

https://en.wikipedia.org/wiki/Climate_of_Australia

watermark.Climate_of_Australia.pdf949 KBDownload

(with page numbers now)

@bmansurov - great job. it's a great week for good news! These look wonderful. A few notes (although I believe these are issues we already identified in electron rather than the post-processing stuff):

article title missing
hyperlinks
trig identities: some things showing up in bold

@ovasileva

Yes, title can be added easily. I was experimenting with generating the ToC and page numbers and the title escaped my attention.
Links are already clickable, aren't they? They are just styled like normal text in print styles. (The items in ToC aren't yet clickable, but I think that's easily fixable.)
Not sure about trig identities, we'll have to investigate more. Note that I used the borwser's print function to get the PDF. In Electron this maybe already working.

I filed a bug some time ago about the SVG issue.

Thanks, @Tgr. Good to know.

Here's a proof of concept approach for getting the results in T168871#3437145.

In Chrome navigate to an article page using the RESTBase endpoint, e.g. https://en.wikipedia.org/api/rest_v1/page/html/Trigonometric_functions

In the developer settings emulate the 'print' media type.

Paste the following code into the console.

// A PoC code that adds the table of contents with page numbers to the beginning of an article.
// The approach here is to find headings and their supposed position after print assuming
// some constants, such as the page size, margins, and a rendered article with print styles
// applied.
(function () {
	var headings = document.querySelectorAll('h2,h3,h4,h5,h6');

	createStyles();
	createTableOfContents(headings);
	addPageNumbersToTableOfContents(
		headings, document.getElementById('table-of-contents'));

	function createStyles() {
		var style = document.createElement('style');

		style.type = 'text/css';
		style.innerHTML = '\
			body { width: 8.27in /* A4 */; margin: 0 auto; }\
			#table-of-contents { page-break-after: always; list-style: none; padding: 0; }\
			.page-number { float: right; }\
			.heading-text.h3 { padding-left: 10px; }\
			.heading-text.h4 { padding-left: 20px; }\
			.heading-text.h5 { padding-left: 30px; }\
			.heading-text.h6 { padding-left: 40px; }\
			#coordinates { display: none; }\
		';
		document.head.appendChild(style);
	}

	function createTableOfContents(headings) {
		var l = headings.length,
			tocHeading = document.createElement('h2'),
			toc = document.createElement('ul'),
			li, i;

		tocHeading.textContent = 'Table of Contents';
		toc.setAttribute('id', 'table-of-contents');

		// add headings and placeholders for page numbers
		for (i = 0; i < l; i++) {
			li = document.createElement('li');
			li.innerHTML = '<span class="heading-text ' + headings[i].tagName.toLowerCase() + '">' + headings[i].textContent + '</span><span class="page-number"></span>';
			toc.appendChild(li);
		}

		document.body.prepend(toc);
		document.body.prepend(tocHeading);
	}

	/**
	 * Some assumptions:
	 *  A4 page dimensions = 8.27 x 11.69 inches
	 *  Top margin = 0.4 inches
	 *  Bottom margin = 0.8 inches (leaving some space for page numbers at the bottom)
	 *  Page content height = 11.57 (page height - top margin - bottom margin)
	 *  1 inch = 96px (DPI can easily be queried using `xdpyinfo | grep -B2 resolution` under Xorg, for example)
	 */
	function addPageNumbersToTableOfContents(headings, toc) {
		var pageHeight = 11.57 * 96,
			tocOffset = toc.offsetHeight + toc.offsetTop,
			// there's always a page break after the ToC (done in CSS above)
			pageOffset = Math.ceil( tocOffset / pageHeight),
			l = headings.length,
			i, offset, page;

		// add page numbers to headings
		for (i = 0; i < l; i++) {
			offset = headings[i].offsetTop;
			page = pageOffset + Math.ceil( (offset - tocOffset) / pageHeight );
			toc.children[i].children[1].textContent = page;
		}
	}
} ());

Use the browser's print to PDF functionality to save the PDF as Trigonometric_functions.pdf.

Add watermark with a third party library. I used pdfrw. See the next steps on how to do so.

Here's a slightly modified version of pdfrw's sample code that adds watermark to a PDF. Save it as watermark.py.

import sys
import os

from pdfrw import PdfReader, PdfWriter, PageMerge

argv = sys.argv[1:]
underneath = '-u' in argv
if underneath:
    del argv[argv.index('-u')]
inpfn, wmarkfn = argv
outfn = 'watermark.' + os.path.basename(inpfn)
trailer = PdfReader(inpfn)
# add page numbers to footer
watermark = PdfReader(wmarkfn)
for i, page in enumerate(trailer.pages):
    wmark = PageMerge().add(watermark.pages[i])[0]
    PageMerge(page).add(wmark, prepend=underneath).render()
PdfWriter().write(outfn, trailer=trailer)

Watermark is just a regular PDF (with the same dimensions and number of pages as the PDF generated in step 4. Here is the watermark PDF that I used:

Page-number.pdf20 KBDownload

Save this file to the same location as the PDF you generated in step 4.

Run the above script as so:

python watermark.py 'Trigonometric_functions.pdf' 'Page-number.pdf'

You'll get a new PDF titled watermark.Trigonometric_functions.pdf. That's the final PDF.

@Tgr what do you think about the approach? Do you think it will scale? Do you see problems with it? Thanks.

Here are some more generated articles from the list in T168004:

watermark.圣地亚哥_(智利).pdf1 MBDownload

watermark.Santiago.pdf3 MBDownload

watermark.سانتياغو.pdf672 KBDownload

watermark.Сантьяго.pdf882 KBDownload

I noticed that some of the page numbers in ToC are wrong and I assume we can fix that by improving the above code in T168871#3439115.

In T168871#3439115, @bmansurov wrote:

@Tgr what do you think about the approach? Do you think it will scale? Do you see problems with it? Thanks.

It's very simple (and consequently more robust than the alternatives) but slightly imprecise since the browser moves things to avoid text being cut in half, which affects the position of everything else after that. You can see it e.g. on the Nightlife and festivals section in Berlin which was close to the edge of the page; it would probably happen more often in longer books.
pdfrw seems to expose the internal dictionaries which is probably all you need to implement the logic in https://gerrit.wikimedia.org/r/#/c/364137/2/includes/PdfPostProcessor.php@33 That would mean only getting page numbers after PDF rendering though, so you'd need to make two passes or add the numbers in the post-processing step.

Thanks, @Tgr. I agree that the solution doesn't take into account the page breaks and other CSS rules that may affect the presentation after printing.

pdfrw does expose some meta info, but the contents were compressed and the primary author of the tool recommended to use another tool to analyze the contents of the PDF.

Search for a better solution continues.

Compression only affects content streams and the data relevant for numbering is not stored as content. The name of a destination (the section id) is a dictionary key (thus, a named string); the target of the destination is a reference to a page object. Annotations are a bit more complex but still parseable.

The PDFs you uploaded do not have internal links: the links point to en.wikipedia.org. Not sure what makes the difference but on my installation both Electron and the Chrome print dialog handled internal links properly, which means there is an annotation for them, which is pretty easy to extract from the PDF file. For example, processing for the file from T168871#3386930 would look something like this:

from pdfrw import PdfReader

pdf = PdfReader(file)
for sectionId in sections:
    dest = pdf.Root.Dests.__getattr__(sectionId)
    if dest:
        page = dest[0]
        pageNum = pdf.pages.index(page) + 1
        print 'section %s is on page %d' % (sectionId, pageNum)

If you want to get an idea of the internal structure of the PDF file, it's worth looking at it with something that can visualize it as a tree. I used iText RUPS (which is pretty broken but was good enough to find my way so I didn't look further; there are probably better tools).

That makes sense. My PDFs didn't have an outline, so I was trying to parse the contents of the PDF. I've thought about this and I think I'll attempt the following approach tomorrow.

Generate an HTML with a table of contents that link to sections (which will turn into annotations when PDF is generated).
Print HTML to PDF.
Find pages of headings and generate an HTML of the table of contents and convert it to PDF.
Remote the pages that contain the original ToC from PDF. Add pages from the new ToC PDF generated in step 3.
Add watermark with page numbers.

The advantage of this approach as opposed to generating the ToC in PDF is that Electron will take care of the layout of the page and we won't have to manually calculate the locations of boxes. I supposes this will also be useful for RTL languages or languages that use different scripts than latin.

Btw, thanks for the tip on iText RUPS -- it's super useful.

Here is the Berlin article using the approach in T168871#3445849:

watermark.Berlin_new_toc.pdf2 MBDownload

Here are some other articles:

Trigonometric_functions_final.pdf1 MBDownload

圣地亚哥_(智利).pdf1 MBDownload

سانتياغو.pdf711 KBDownload

Howto

0. You're going to need the pdfrw library. Install it. Also, save this watermark file in your working directory (it will be used for adding page numbers to pages):

Page-number.pdf20 KBDownload

Visit https://en.wikipedia.org/api/rest_v1/page/html/Berlin and emulate print styles.
Paste the following code into the console.

(function () {
	var headings = document.querySelectorAll('h2,h3,h4,h5,h6');

	createStyles();
	createTableOfContents(headings);

	function createStyles() {
		var style = document.createElement('style');

		style.type = 'text/css';
		style.innerHTML = '\
			body { width: 8.27in /* A4 */; margin: 0 auto; }\
			#table-of-contents { page-break-after: always; list-style: none; padding: 0; }\
			.heading-text.h3 { padding-left: 10px !important; }\
			.heading-text.h4 { padding-left: 20px !important; }\
			.heading-text.h5 { padding-left: 30px !important; }\
			.heading-text.h6 { padding-left: 40px !important; }\
			#coordinates { display: none; }\
		';
		document.head.appendChild(style);
	}

	function createTableOfContents(headings) {
		var l = headings.length,
			tocHeading = document.createElement('h2'),
			toc = document.createElement('ul'),
			li, i, heading, newId;

		tocHeading.textContent = 'Table of Contents';
		toc.setAttribute('id', 'table-of-contents');

		// add headings to ToC, encode heading level into the link and the heading itself
		for (i = 0; i < l; i++) {
			heading = headings[i];
			newId = heading.getAttribute('id') + '--' + encodeURIComponent(heading.textContent) + '--' + heading.tagName.toLowerCase();
			li = document.createElement('li');
			li.innerHTML = '<a href="#' + newId + '" class="heading-text ' + heading.tagName.toLowerCase() + '">' + heading.textContent + '</a>';
			toc.appendChild(li);
			heading.setAttribute('id', newId);
		}

		document.body.prepend(toc);
		document.body.prepend(tocHeading);
	}
} ());

Print the page to PDF using the browser and save the file as 'Berlin.pdf'
Run the following script:

from operator import itemgetter
import urllib

from pdfrw import PdfReader, PdfWriter, PageMerge

def extractTocFromPdf(pdf):
    """ Extracts the table of contetns from PDF annotations.
        Heading IDs have been altered with JS (before printing) to
        include information about the heading title and the level.
    """
    endings = ('h2', 'h3', 'h4', 'h5', 'h6',)
    dests = []
    for dest_name, dest in pdf.Root.Dests.items():
        dest_splits = dest_name.rsplit('--', 2)
        if len(dest_splits) == 3 and dest_splits[2] in endings:
            # (achor, heading, level, 1-based page number)
            dests.append((
                '#' + dest_name[1:],
                urllib.unquote(dest_splits[1]),
                dest_splits[2],
                pdf.pages.index(dest[0]) + 1
            ))

    # sort by page number and heading
    dests.sort(key=itemgetter(3, 2))
    return dests

def generateTocHtml(toc, file_name):
    """ Generates the table of contents and saves it as HTML.
        TODO: move CSS rules to print styles, possibly as a separate
              RL module and load that module instead of the modules below.
        TODO: use a template to generate the HTML
        TODO: make HTML and CSS RTL-aware
    """
    html = '<html><head><link rel="stylesheet" href="https://en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Cskins.vector.styles%7Csite.styles%7Cext.cite.style%7Cmediawiki.page.gallery.styles&amp;only=styles&amp;skin=vector"/>'
    html += """<style>body { width: 8.27in /* A4 */; margin: 0 auto; }
		#table-of-contents { page-break-after: always; list-style: none; padding: 0; }
		#table-of-contents a { display: block; }
		.heading-text.h3 { padding-left: 10px !important; }
		.heading-text.h4 { padding-left: 20px !important; }
		.heading-text.h5 { padding-left: 30px !important; }
		.heading-text.h6 { padding-left: 40px !important; }
		.page-number { float: right; }
		#coordinates { display: none; }</style>"""
    html += '</head><body>'
    html += '<h2>Table of Contents</h2>'
    html += '<ul id="table-of-contents">'

    for item in toc:
        html += '<li><a href="%s"><span class="heading-text %s">%s</span><span class="page-number">%s</span></a>' %\
                              (item[0], item[2], item[1], item[3])

    html += '</ul>'
    html += '</body></html>'
    with open(file_name, 'w') as fout:
        fout.write(html)


def replaceToc(pdf, new_toc, file_name):
    """ Replace the old table of contents (without page numbers) with
        the one with page numbers
    """
    outdata = PdfWriter()
    outdata.addpages(new_toc.pages)
    n_toc_pages = len(new_toc.pages)
    outdata.addpages(pdf.pages[n_toc_pages:])
    outdata.write(file_name)

if __name__ == '__main__':
    pdf = PdfReader('Berlin.pdf')
    toc = extractTocFromPdf(pdf)
    generateTocHtml(toc, 'Berlin_toc.html')
    # TODO: Open 'Berlin_toc.html' in the browser and print to PDF and save it as 'Berlin_toc.pdf'
    print("Open 'Berlin_toc.html' in the browser and print to PDF and save it as 'Berlin_toc.pdf'. Enter 1 when you're done.")
    if (input() == 1):
        new_toc = PdfReader('Berlin_toc.pdf')
        replaceToc(pdf, new_toc, 'Berlin_new_toc.pdf')
        infn = PdfReader('Berlin_new_toc.pdf')
        outfn = 'Berlin_final.pdf'
        watermark = PdfReader('Page-number.pdf')
        for i, page in enumerate(infn.pages):
            wmark = PageMerge().add(watermark.pages[i])[0]
            PageMerge(page).add(wmark, prepend=True).render()
        PdfWriter().write(outfn, trailer=infn)
        print('Ready. See Berlin_final.pdf')

Notes

The above approach is a proof of concept. It shows that what we're trying to do is achievable. Since we printed a RESTBase endpoint first and then manipulated the output PDF, we needed to encode some data about headings to PDF annotations. This step won't be necessary if we pass in this information using Extension:Collections.

@Tgr does the approach in T168871#3449625 sound sane to you? Is this something we can improve on and push to production? Any concerns?

I think this is a plausible way to do it and I don't see anything wrong with the logic. The new ToC mustn't take up a different number of pages than the replaced one,
but that doesn't seem hard to ensure. (Re: generating the ToC and passing information, https://gerrit.wikimedia.org/r/#/c/364137 contains some code for that that's IMO production-ready.)

I'll defer to Ops/Services/Security on whether it is better or worse than wkhtmltopdf. I would prefer PDF post-processing but neither approach is great. (Of course if we had a great option, we wouldn't be still debating this...) The PHP process has to keep running and keep the connection open while Electron is working (plus while the python script is working, although I'd imagine that to take much less time), but that doesn't seem tragic.

Product-wise this is better than wkhtmltopdf as it can do the same things (except maybe outlines although those are doable too as long as the PDF library supports adding to the structured data in the file) and we aren't rendering with a 2 years old browser with outdated CSS support. Vivliostyle has some nice extras (a web view, support for all kind of print-oriented styling features like footnotes or page floats) but is less stable - it might be complementary but probably shouldn't be our only option.

@GWicke , @faidon - would you mind reviewing the approach in T168871#3449625?

ovasileva mentioned this in T159922: pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003.Jul 19 2017, 10:28 AM

• bmansurov moved this task from Doing to Blocked on Others on the Readers-Web-Kanbanana-Board-Old board.Jul 19 2017, 1:14 PM

In general, the combination of HTML concatenation / pre-processing, basic browser-based rendering, and PDF post-processing makes sense to me. The important thing from my perspective is that we'll be able to use a well-maintained PDF renderer, which is the most complex part of the system. I also agree with @Tgr's assessment of Vivliostyle. It would be more elegant in principle, but might make more sense to tackle in a second step.

For an actual production deploy, we'll need to think about how we structure this to ensure decent security and maintainability. We should also talk about the longer term ownership of the pieces.

For an actual production deploy, we'll need to think about how we structure this to ensure decent security and maintainability. We should also talk about the longer term ownership of the pieces.

@GWicke - in terms of ownership, would it make sense for the web team to take responsibility for styles and post-processing, while general Electron maintenance would still be in the hands of the Services team?

I don't have a strong preference for either. I think the post-processing approach makes sense overall and without looking at it very closely, it seems to me like Electron (and headless Chrome) would be better bets compared to wkhtmltopdf with regards to maintainability, compatibility, security etc.

My only concern would be around how this would like internally in terms of services/microservices. For example, I remember reading before about parsing RESTBase HTML in MediaWiki with Remex etc. which sounded like it would create even more service loops in our architecture (something I'm always not too fond of). These are implementation details though -which I'm sure we can figure out as we move forward- and not concerns around the overall high-level design.

ovasileva mentioned this in T169738: [Spike 8hrs] Investigate ability of using post-processing approach with new print styles.Jul 21 2017, 10:42 AM

• bmansurov removed • bmansurov as the assignee of this task.Jul 21 2017, 8:46 PM

• bmansurov moved this task from Blocked on Others to Ready for Signoff on the Readers-Web-Kanbanana-Board-Old board.

seems like this one is ready to go and we can commit to this solution. I'll set up the follow-up tasks. Thanks @bmansurov!

ovasileva added a subscriber: • dpatrick.Jul 27 2017, 10:07 AM

• bmansurov mentioned this in T171960: Create a library to post-process PDF and add page numbers and table of contents.Jul 28 2017, 3:18 PM

• bmansurov mentioned this in T171964: [Spike - 8 hrs] Where should article concatenation be implemented?.Aug 17 2017, 2:39 PM

• bmansurov added a parent task: T171960: Create a library to post-process PDF and add page numbers and table of contents.Aug 18 2017, 4:32 PM

Change 364137 abandoned by Gergő Tisza:
[WIP] Add TOC and page numbers via PDF post-processing

Reason:
There is no decent PHP library for PDF metadata manipulation (Zend_PDF, which was used here, is unmaintained). Will be shelled out to Python instead.