Page MenuHomePhabricator

Introduct toc with page numbers during pdf post-processing
Closed, ResolvedPublic

Description

Background

The ElectronPDF service does not satisfy all requirements for the books feature, namely allowing books to have a toc that contains page numbers

Acceptance criteria

Investigate how to process and Electron-created concatenated PDF to include the following:

  • Article concatenation
  • Page numbers
  • TOC with page numbers
  • Title page

note: the pages for the toc do not need to be numbered, page #1 will be the first page of the first article
note: complete requirements for the books feature may be found at the PDF functionality page: https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality#Current_Functionality_Requirements

Related Objects

StatusSubtypeAssignedTask
Resolved JKatzWMF
ResolvedTheDJ
InvalidNone
InvalidNone
StalledNone
InvalidNone
DuplicateNone
DeclinedNone
InvalidNone
InvalidNone
Resolved bmansurov
Invalidovasileva
Resolvedovasileva
Resolvedpmiazga
Resolvedphuedx
Resolvedovasileva
Invalidovasileva
Resolvedphuedx
Resolvedphuedx
ResolvedJdlrobson
InvalidNone
Resolvedovasileva
InvalidNone
InvalidNone
Resolved dpatrick

Event Timeline

ovasileva moved this task from Incoming to 2014-15 Q4 on the Web-Team-Backlog board.
ovasileva updated the task description. (Show Details)

Article concatenation can be done in PHP, by generating a concatenated HTML before sending to Electron. (It probably has to be, whether we use Electron or not, as we need to do HTML transformations based on it, such as making sure that two articles with the same section name have non-identical section ids). Title page also can be done that way. https://gerrit.wikimedia.org/r/#/c/361453/ has some PHP code that does both. Here is a book rendered with that patch (which uses Electron): sample.

Page numbers and TOC numbers are the only problematic parts. I am still investigating those. (It's doable in theory but all the PHP libraries turned out to be pretty horrible. There are three free libraries with decent PDF generation abilities: TCPDF, FPDI and ZendPdf. The first two have very limited support for modifying existing PDF files; ZendPdf seems to have the ability but it's unmaintained and poorly documented. We could also do it in Node and hack it into the Electron service, or use a library in an arbitrary language and shell out to it; I haven't looked at those options yet.)

Change 364137 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/Collection@master] [WIP] Add TOC and page numbers via PDF post-processing

https://gerrit.wikimedia.org/r/364137

Added a proof-of-concept patch that adds outline, line numbers and TOC numbers. It is shoddy (uses two different PDF libraries for no good reason) and mostly relies on ZendPdf which is undocumented, unfinished and unmaintained; but it shouldn't be too hard to adapt the logic to some other PDF processing library.

I'm looking at some Python libraries that can do this. To me this method seems the sanest in terms of maintainability and getting the best results. Because PDF generation will be off to a headless Chrome or Electron, and this extra step will add a table of contents, page numbers, and a cover page.

Pulling this into the sprint. As the vivliostyles solution is currently blocked, we will be looking further into creating page numbers and TOC via post-processing.

I think I'm onto something cool. Here's the output. Please enjoy it while I write up how I came up with it.

https://en.wikipedia.org/wiki/Berlin

https://en.wikipedia.org/wiki/Trigonometric_functions

https://en.wikipedia.org/wiki/Climate_of_Australia

(with page numbers now)

@bmansurov - great job. it's a great week for good news! These look wonderful. A few notes (although I believe these are issues we already identified in electron rather than the post-processing stuff):

  • article title missing
  • hyperlinks
  • trig identities: some things showing up in bold

@ovasileva

  • Yes, title can be added easily. I was experimenting with generating the ToC and page numbers and the title escaped my attention.
  • Links are already clickable, aren't they? They are just styled like normal text in print styles. (The items in ToC aren't yet clickable, but I think that's easily fixable.)
  • Not sure about trig identities, we'll have to investigate more. Note that I used the borwser's print function to get the PDF. In Electron this maybe already working.

I filed a bug some time ago about the SVG issue.

Here's a proof of concept approach for getting the results in T168871#3437145.

  1. In Chrome navigate to an article page using the RESTBase endpoint, e.g. https://en.wikipedia.org/api/rest_v1/page/html/Trigonometric_functions
  1. In the developer settings emulate the 'print' media type.
  1. Paste the following code into the console.
// A PoC code that adds the table of contents with page numbers to the beginning of an article.
// The approach here is to find headings and their supposed position after print assuming
// some constants, such as the page size, margins, and a rendered article with print styles
// applied.
(function () {
	var headings = document.querySelectorAll('h2,h3,h4,h5,h6');

	createStyles();
	createTableOfContents(headings);
	addPageNumbersToTableOfContents(
		headings, document.getElementById('table-of-contents'));

	function createStyles() {
		var style = document.createElement('style');

		style.type = 'text/css';
		style.innerHTML = '\
			body { width: 8.27in /* A4 */; margin: 0 auto; }\
			#table-of-contents { page-break-after: always; list-style: none; padding: 0; }\
			.page-number { float: right; }\
			.heading-text.h3 { padding-left: 10px; }\
			.heading-text.h4 { padding-left: 20px; }\
			.heading-text.h5 { padding-left: 30px; }\
			.heading-text.h6 { padding-left: 40px; }\
			#coordinates { display: none; }\
		';
		document.head.appendChild(style);
	}

	function createTableOfContents(headings) {
		var l = headings.length,
			tocHeading = document.createElement('h2'),
			toc = document.createElement('ul'),
			li, i;

		tocHeading.textContent = 'Table of Contents';
		toc.setAttribute('id', 'table-of-contents');

		// add headings and placeholders for page numbers
		for (i = 0; i < l; i++) {
			li = document.createElement('li');
			li.innerHTML = '<span class="heading-text ' + headings[i].tagName.toLowerCase() + '">' + headings[i].textContent + '</span><span class="page-number"></span>';
			toc.appendChild(li);
		}

		document.body.prepend(toc);
		document.body.prepend(tocHeading);
	}

	/**
	 * Some assumptions:
	 *  A4 page dimensions = 8.27 x 11.69 inches
	 *  Top margin = 0.4 inches
	 *  Bottom margin = 0.8 inches (leaving some space for page numbers at the bottom)
	 *  Page content height = 11.57 (page height - top margin - bottom margin)
	 *  1 inch = 96px (DPI can easily be queried using `xdpyinfo | grep -B2 resolution` under Xorg, for example)
	 */
	function addPageNumbersToTableOfContents(headings, toc) {
		var pageHeight = 11.57 * 96,
			tocOffset = toc.offsetHeight + toc.offsetTop,
			// there's always a page break after the ToC (done in CSS above)
			pageOffset = Math.ceil( tocOffset / pageHeight),
			l = headings.length,
			i, offset, page;

		// add page numbers to headings
		for (i = 0; i < l; i++) {
			offset = headings[i].offsetTop;
			page = pageOffset + Math.ceil( (offset - tocOffset) / pageHeight );
			toc.children[i].children[1].textContent = page;
		}
	}
} ());
  1. Use the browser's print to PDF functionality to save the PDF as Trigonometric_functions.pdf.
  1. Add watermark with a third party library. I used pdfrw. See the next steps on how to do so.
  1. Here's a slightly modified version of pdfrw's sample code that adds watermark to a PDF. Save it as watermark.py.
import sys
import os

from pdfrw import PdfReader, PdfWriter, PageMerge

argv = sys.argv[1:]
underneath = '-u' in argv
if underneath:
    del argv[argv.index('-u')]
inpfn, wmarkfn = argv
outfn = 'watermark.' + os.path.basename(inpfn)
trailer = PdfReader(inpfn)
# add page numbers to footer
watermark = PdfReader(wmarkfn)
for i, page in enumerate(trailer.pages):
    wmark = PageMerge().add(watermark.pages[i])[0]
    PageMerge(page).add(wmark, prepend=underneath).render()
PdfWriter().write(outfn, trailer=trailer)
  1. Watermark is just a regular PDF (with the same dimensions and number of pages as the PDF generated in step 4. Here is the watermark PDF that I used:

Save this file to the same location as the PDF you generated in step 4.

  1. Run the above script as so:
python watermark.py 'Trigonometric_functions.pdf' 'Page-number.pdf'

You'll get a new PDF titled watermark.Trigonometric_functions.pdf. That's the final PDF.

@Tgr what do you think about the approach? Do you think it will scale? Do you see problems with it? Thanks.

Here are some more generated articles from the list in T168004:

I noticed that some of the page numbers in ToC are wrong and I assume we can fix that by improving the above code in T168871#3439115.

@Tgr what do you think about the approach? Do you think it will scale? Do you see problems with it? Thanks.

It's very simple (and consequently more robust than the alternatives) but slightly imprecise since the browser moves things to avoid text being cut in half, which affects the position of everything else after that. You can see it e.g. on the Nightlife and festivals section in Berlin which was close to the edge of the page; it would probably happen more often in longer books.
pdfrw seems to expose the internal dictionaries which is probably all you need to implement the logic in https://gerrit.wikimedia.org/r/#/c/364137/2/includes/PdfPostProcessor.php@33 That would mean only getting page numbers after PDF rendering though, so you'd need to make two passes or add the numbers in the post-processing step.

Thanks, @Tgr. I agree that the solution doesn't take into account the page breaks and other CSS rules that may affect the presentation after printing.

pdfrw does expose some meta info, but the contents were compressed and the primary author of the tool recommended to use another tool to analyze the contents of the PDF.

Search for a better solution continues.

Compression only affects content streams and the data relevant for numbering is not stored as content. The name of a destination (the section id) is a dictionary key (thus, a named string); the target of the destination is a reference to a page object. Annotations are a bit more complex but still parseable.

The PDFs you uploaded do not have internal links: the links point to en.wikipedia.org. Not sure what makes the difference but on my installation both Electron and the Chrome print dialog handled internal links properly, which means there is an annotation for them, which is pretty easy to extract from the PDF file. For example, processing for the file from T168871#3386930 would look something like this:

from pdfrw import PdfReader

pdf = PdfReader(file)
for sectionId in sections:
    dest = pdf.Root.Dests.__getattr__(sectionId)
    if dest:
        page = dest[0]
        pageNum = pdf.pages.index(page) + 1
        print 'section %s is on page %d' % (sectionId, pageNum)

If you want to get an idea of the internal structure of the PDF file, it's worth looking at it with something that can visualize it as a tree. I used iText RUPS (which is pretty broken but was good enough to find my way so I didn't look further; there are probably better tools).

That makes sense. My PDFs didn't have an outline, so I was trying to parse the contents of the PDF. I've thought about this and I think I'll attempt the following approach tomorrow.

  1. Generate an HTML with a table of contents that link to sections (which will turn into annotations when PDF is generated).
  2. Print HTML to PDF.
  3. Find pages of headings and generate an HTML of the table of contents and convert it to PDF.
  4. Remote the pages that contain the original ToC from PDF. Add pages from the new ToC PDF generated in step 3.
  5. Add watermark with page numbers.

The advantage of this approach as opposed to generating the ToC in PDF is that Electron will take care of the layout of the page and we won't have to manually calculate the locations of boxes. I supposes this will also be useful for RTL languages or languages that use different scripts than latin.

Btw, thanks for the tip on iText RUPS -- it's super useful.

Here is the Berlin article using the approach in T168871#3445849:

Here are some other articles:

Howto

0. You're going to need the pdfrw library. Install it. Also, save this watermark file in your working directory (it will be used for adding page numbers to pages):

  1. Visit https://en.wikipedia.org/api/rest_v1/page/html/Berlin and emulate print styles.
  2. Paste the following code into the console.
(function () {
	var headings = document.querySelectorAll('h2,h3,h4,h5,h6');

	createStyles();
	createTableOfContents(headings);

	function createStyles() {
		var style = document.createElement('style');

		style.type = 'text/css';
		style.innerHTML = '\
			body { width: 8.27in /* A4 */; margin: 0 auto; }\
			#table-of-contents { page-break-after: always; list-style: none; padding: 0; }\
			.heading-text.h3 { padding-left: 10px !important; }\
			.heading-text.h4 { padding-left: 20px !important; }\
			.heading-text.h5 { padding-left: 30px !important; }\
			.heading-text.h6 { padding-left: 40px !important; }\
			#coordinates { display: none; }\
		';
		document.head.appendChild(style);
	}

	function createTableOfContents(headings) {
		var l = headings.length,
			tocHeading = document.createElement('h2'),
			toc = document.createElement('ul'),
			li, i, heading, newId;

		tocHeading.textContent = 'Table of Contents';
		toc.setAttribute('id', 'table-of-contents');

		// add headings to ToC, encode heading level into the link and the heading itself
		for (i = 0; i < l; i++) {
			heading = headings[i];
			newId = heading.getAttribute('id') + '--' + encodeURIComponent(heading.textContent) + '--' + heading.tagName.toLowerCase();
			li = document.createElement('li');
			li.innerHTML = '<a href="#' + newId + '" class="heading-text ' + heading.tagName.toLowerCase() + '">' + heading.textContent + '</a>';
			toc.appendChild(li);
			heading.setAttribute('id', newId);
		}

		document.body.prepend(toc);
		document.body.prepend(tocHeading);
	}
} ());
  1. Print the page to PDF using the browser and save the file as 'Berlin.pdf'
  2. Run the following script:
from operator import itemgetter
import urllib

from pdfrw import PdfReader, PdfWriter, PageMerge

def extractTocFromPdf(pdf):
    """ Extracts the table of contetns from PDF annotations.
        Heading IDs have been altered with JS (before printing) to
        include information about the heading title and the level.
    """
    endings = ('h2', 'h3', 'h4', 'h5', 'h6',)
    dests = []
    for dest_name, dest in pdf.Root.Dests.items():
        dest_splits = dest_name.rsplit('--', 2)
        if len(dest_splits) == 3 and dest_splits[2] in endings:
            # (achor, heading, level, 1-based page number)
            dests.append((
                '#' + dest_name[1:],
                urllib.unquote(dest_splits[1]),
                dest_splits[2],
                pdf.pages.index(dest[0]) + 1
            ))

    # sort by page number and heading
    dests.sort(key=itemgetter(3, 2))
    return dests

def generateTocHtml(toc, file_name):
    """ Generates the table of contents and saves it as HTML.
        TODO: move CSS rules to print styles, possibly as a separate
              RL module and load that module instead of the modules below.
        TODO: use a template to generate the HTML
        TODO: make HTML and CSS RTL-aware
    """
    html = '<html><head><link rel="stylesheet" href="https://en.wikipedia.org/w/load.php?modules=mediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Cskins.vector.styles%7Csite.styles%7Cext.cite.style%7Cmediawiki.page.gallery.styles&amp;only=styles&amp;skin=vector"/>'
    html += """<style>body { width: 8.27in /* A4 */; margin: 0 auto; }
		#table-of-contents { page-break-after: always; list-style: none; padding: 0; }
		#table-of-contents a { display: block; }
		.heading-text.h3 { padding-left: 10px !important; }
		.heading-text.h4 { padding-left: 20px !important; }
		.heading-text.h5 { padding-left: 30px !important; }
		.heading-text.h6 { padding-left: 40px !important; }
		.page-number { float: right; }
		#coordinates { display: none; }</style>"""
    html += '</head><body>'
    html += '<h2>Table of Contents</h2>'
    html += '<ul id="table-of-contents">'

    for item in toc:
        html += '<li><a href="%s"><span class="heading-text %s">%s</span><span class="page-number">%s</span></a>' %\
                              (item[0], item[2], item[1], item[3])

    html += '</ul>'
    html += '</body></html>'
    with open(file_name, 'w') as fout:
        fout.write(html)


def replaceToc(pdf, new_toc, file_name):
    """ Replace the old table of contents (without page numbers) with
        the one with page numbers
    """
    outdata = PdfWriter()
    outdata.addpages(new_toc.pages)
    n_toc_pages = len(new_toc.pages)
    outdata.addpages(pdf.pages[n_toc_pages:])
    outdata.write(file_name)

if __name__ == '__main__':
    pdf = PdfReader('Berlin.pdf')
    toc = extractTocFromPdf(pdf)
    generateTocHtml(toc, 'Berlin_toc.html')
    # TODO: Open 'Berlin_toc.html' in the browser and print to PDF and save it as 'Berlin_toc.pdf'
    print("Open 'Berlin_toc.html' in the browser and print to PDF and save it as 'Berlin_toc.pdf'. Enter 1 when you're done.")
    if (input() == 1):
        new_toc = PdfReader('Berlin_toc.pdf')
        replaceToc(pdf, new_toc, 'Berlin_new_toc.pdf')
        infn = PdfReader('Berlin_new_toc.pdf')
        outfn = 'Berlin_final.pdf'
        watermark = PdfReader('Page-number.pdf')
        for i, page in enumerate(infn.pages):
            wmark = PageMerge().add(watermark.pages[i])[0]
            PageMerge(page).add(wmark, prepend=True).render()
        PdfWriter().write(outfn, trailer=infn)
        print('Ready. See Berlin_final.pdf')

Notes

The above approach is a proof of concept. It shows that what we're trying to do is achievable. Since we printed a RESTBase endpoint first and then manipulated the output PDF, we needed to encode some data about headings to PDF annotations. This step won't be necessary if we pass in this information using Extension:Collections.

@Tgr does the approach in T168871#3449625 sound sane to you? Is this something we can improve on and push to production? Any concerns?

I think this is a plausible way to do it and I don't see anything wrong with the logic. The new ToC mustn't take up a different number of pages than the replaced one,
but that doesn't seem hard to ensure. (Re: generating the ToC and passing information, https://gerrit.wikimedia.org/r/#/c/364137 contains some code for that that's IMO production-ready.)

I'll defer to Ops/Services/Security on whether it is better or worse than wkhtmltopdf. I would prefer PDF post-processing but neither approach is great. (Of course if we had a great option, we wouldn't be still debating this...) The PHP process has to keep running and keep the connection open while Electron is working (plus while the python script is working, although I'd imagine that to take much less time), but that doesn't seem tragic.

Product-wise this is better than wkhtmltopdf as it can do the same things (except maybe outlines although those are doable too as long as the PDF library supports adding to the structured data in the file) and we aren't rendering with a 2 years old browser with outdated CSS support. Vivliostyle has some nice extras (a web view, support for all kind of print-oriented styling features like footnotes or page floats) but is less stable - it might be complementary but probably shouldn't be our only option.

In general, the combination of HTML concatenation / pre-processing, basic browser-based rendering, and PDF post-processing makes sense to me. The important thing from my perspective is that we'll be able to use a well-maintained PDF renderer, which is the most complex part of the system. I also agree with @Tgr's assessment of Vivliostyle. It would be more elegant in principle, but might make more sense to tackle in a second step.

For an actual production deploy, we'll need to think about how we structure this to ensure decent security and maintainability. We should also talk about the longer term ownership of the pieces.

For an actual production deploy, we'll need to think about how we structure this to ensure decent security and maintainability. We should also talk about the longer term ownership of the pieces.

@GWicke - in terms of ownership, would it make sense for the web team to take responsibility for styles and post-processing, while general Electron maintenance would still be in the hands of the Services team?

I don't have a strong preference for either. I think the post-processing approach makes sense overall and without looking at it very closely, it seems to me like Electron (and headless Chrome) would be better bets compared to wkhtmltopdf with regards to maintainability, compatibility, security etc.

My only concern would be around how this would like internally in terms of services/microservices. For example, I remember reading before about parsing RESTBase HTML in MediaWiki with Remex etc. which sounded like it would create even more service loops in our architecture (something I'm always not too fond of). These are implementation details though -which I'm sure we can figure out as we move forward- and not concerns around the overall high-level design.

ovasileva claimed this task.

seems like this one is ready to go and we can commit to this solution. I'll set up the follow-up tasks. Thanks @bmansurov!

Change 364137 abandoned by Gergő Tisza:
[WIP] Add TOC and page numbers via PDF post-processing

Reason:
There is no decent PHP library for PDF metadata manipulation (Zend_PDF, which was used here, is unmaintained). Will be shelled out to Python instead.

https://gerrit.wikimedia.org/r/364137