[Spike] Determine changes necessary for concatenation support
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	ovasileva
	Apr 18 2017, 10:14 PM

Description

Background - we would like to investigate replicating some or most of OCG's functionality using the Electron PDF service. Namely, we would like the ability to concatenate articles and allow transformations which direct the look and feel of Electron.

Acceptance Criteria
Determine the best way to set up the back end for rendering concatenated PDF's according to the following requirements:

PDF generation must be triggered from the book creator and from download as PDF links (we must be able to generate PDF's for single and multiple articles)
For multiple articles (books), the current UI of the book creator will be used
Users will be able to select between a two-column and single-column layout, where the two-column layout will render using OCG and the single column layout will render using electron. This must be available for both books and individual articles (similar to current implementation on mediawiki (https://www.mediawiki.org/w/index.php?title=Special:ElectronPdf&page=MediaWiki&action=show-selection-screen&coll-download-url=%2Fw%2Findex.php%3Ftitle%3DSpecial%3ABook%26bookcmd%3Drender_article%26arttitle%3DMediaWiki%26returnto%3DMediaWiki%26oldid%3D2301969%26writer%3Drdf2latex)
Concatenated PDFs must include the following:
- Table of contents
  - Table of contents must contain the individual table of contents for each article as subsections
  - Table of contents must be clickable - selecting a link from the table of contents must navigate to the correct position within the article
- All tables and infoboxes available in the original articles
- Chapter structure - each article must be numbered as a chapter and marked accordingly in the table of contents
- References
  - References will appear individually at the end of each article. If links are available within the references, they will be available within the created PDF
- Blue links - all blue links will be available within the PDF. Blue links will be styled differently
- Styles - styles must contain the current desktop print styles (in progress here: T135022: [EPIC] Improve print styles in desktop and mobile sites)
- Contributions:
  - all text contributors - a section for contributions will appear at the end of each book. The list of contributors will be separated by the name of the article
  - all image contributors - a section for image contributions will appear at the end of each book. The list of contributors will be separated by the name of the article
- Content license

Example Structure:
Book title (page break)
Table of contents (page break)
Chapter 1, Article Title, article 1, article 1 references (page break)
Chapter 2, Article Title, article 2, article 2 references (page break)
Text and image sources, contributors, and licenses

Section 1: text sources
- Section 1.1: text sources article 1,
- Section 1.2: text sources article 2
Section 2: image sources
- Section 1.1: image sources article 1
- Section 1.2: image sources article 2
Section 3: content license

Questions to answer

In T163272#3207989, @bmansurov wrote:

To summarize, we need to find a way to make transformations to HTML/wikitext before rendering a PDF. The requirements for the transformations are listed in the acceptance criterion "Concatenated PDFs must include the following:" and it's sub-bullets.

Outcomes

Using wkhtmltopdf we'll have to make the following transformations:

Create a cover page in HTML. Make sure that the book title is vertically and horizontally aligned in the middle.
Retrieve articles from RESTBase, e.g. Book, and lay them out in the hierarchy requested in requrested metabook. For each article:
- Create a title with the chapter number, e.g. "1. Apple"
- Prefix section titles with the chapter number and section number, e.g. "1.1. Botanical Information"
- Since articles can be grouped into chapters on Special:Book, we need to make each article a subsection of a chapter if the article is a part of a group. For example, if I'm interested in creating a book about fruits and vegetables, I may have two chapters called "1. Fruits" and "2. Vegetables". The article "Apple", would go under "1. Fruits" and be titled as "1.1. Apple". Sections of the article would be prefixed with "1.1.1.", "1.1.2", etc.
- Change references links to point to the references on the page as opposed to the references in the source URL.
- Remove red links.
- We may also have to '"push down" headings when the page has a =-level section' but this case is rare.
Retrieve Contributors, images (https://en.wikipedia.org/w/api.php?action=query&titles=File%3ABook_Collage.png&prop=imageinfo&iiprop=url|size|mediatype|mime|sha1|extmetadata), and licence info from the MW API endpoint and create an HTML page using them.
Generate a PDF using wkhtmltopdf. The table of contents and outline will be generated automatically if the correct arguments are passed. An example command is as follows:

./wkhtmltopdf cover http://mw.loc/w/cover.html toc page https://en.wikipedia.org/api/rest_v1/page/html/Apple/781322367/8461718d-3d68-11e7-86c3-bba2fc26f3f6 https://en.wikipedia.org/api/rest_v1/page/html/Pear https://en.wikipedia.org/api/rest_v1/page/html/Cherry https://en.wikipedia.org/api/rest_v1/page/html/Grape https://en.wikipedia.org/api/rest_v1/page/html/Persimmon --print-media-type fruits.pdf

We'd have to point RESTBase urls from the above to local HTML files that we created using the transformations.

Related Objects
Search...

Status	Assigned	Task
Resolved	• JKatzWMF	T150871 [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
Resolved	TheDJ	T150872 Replace OCG in collection extension with Electron
Resolved	pmiazga	T163272 [Spike] Determine changes necessary for concatenation support

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2017, 10:14 PM

@bmansurov, @Tgr - it looks like we'll probably need separate styles for books and individual articles. @Nirzar - any thought on this?

• Yunoselect5 closed this task as Resolved.Apr 18 2017, 10:24 PM

• Yunoselect5 claimed this task.

• Yunoselect5 removed a subscriber: Aklapper.

• Yunoselect5 added a subscriber: Aklapper.Apr 18 2017, 10:27 PM

ovasileva reopened this task as Open.Apr 18 2017, 10:27 PM

• Yunoselect5 closed this task as Invalid.Apr 18 2017, 10:32 PM

• Yunoselect5 reopened this task as Open.

• Yunoselect5 lowered the priority of this task from High to Low.

• Yunoselect5 raised the priority of this task from Low to High.

• Yunoselect5 reassigned this task from • Yunoselect5 to ovasileva.Apr 18 2017, 10:41 PM

• Yunoselect5 subscribed.

• Yunoselect5 unsubscribed.

• Yunoselect5 awarded a token.Apr 18 2017, 10:43 PM

• Yunoselect5 rescinded a token.

ovasileva added a parent task: T150872: Replace OCG in collection extension with Electron.Apr 18 2017, 11:25 PM

Jdlrobson moved this task from Incoming to Product Owner Backlog on the Web-Team-Backlog board.Apr 19 2017, 12:28 AM

FWIW the OCG HTML transformation logic is in the Visitor class and it does not seem to do anything interesting, beyond filtering out various things.

One thing to consider for the future is that with TemplateStyles editors will be able to add print-specific styles for their templates. (It's already possible via MediaWiki:*.css but impractical.) It would probably be useful to have some kind of tutorial/guidelines/best practices document for that.

The main technical question IMO is what tool to use for HTML transformations: some npm library (that would mean putting the logic in the Electron service which seems less nice than making Collection self-contained), or simple text processing (that will end badly unless we are sure only very simple transformations will be needed), or DOM manipulation in PHP (that will create a dependency on RemexHtml as I don't think there is anything else out there able to deal with HTML5).

The necessary HTML transformations I can think of:

change ids / URL fragments to be unique
remove tables of content
change h1 to h2 (or push everything down a level?)

ovasileva moved this task from Product Owner Backlog to Upcoming on the Web-Team-Backlog board.Apr 21 2017, 4:01 PM

Jdlrobson updated the task description. (Show Details)Apr 24 2017, 7:19 PM

baha to add etherpad notes and summarise the questions we need to answer.

To summarize, we need to find a way to make transformations to HTML/wikitext before rendering a PDF. The requirements for the transformations are listed in the acceptance criterion "Concatenated PDFs must include the following:" and it's sub-bullets.

During our meeting, @Tgr mentioned some issues we may have and captured his thoughts in T163272#3195371. One thing that's not mentioned there is how to deal with references if we have to move all article references to the end of the PDF. Since that's not what the requirement asks, we don't have to worry about it just yet.

Here is the link to the etherpad notes that were taken during our meeting.

• NHarateh_WMF added a project: Product-Infrastructure-Team-Backlog-Deprecated.Apr 25 2017, 12:42 PM

• NHarateh_WMF moved this task from Needs triage to Backlog on the Product-Infrastructure-Team-Backlog-Deprecated board.Apr 25 2017, 12:44 PM

• Jhernandez updated the task description. (Show Details)Apr 25 2017, 4:37 PM

What is the outcome from the research? (find a way to...)

An implementation proposal? Possible options? A meeting for discussing them?

We're discussing this and we're not sure how to timebox it without knowing which sort of outcomes we want cc/ @ovasileva @bmansurov @Jdlrobson

pmiazga subscribed.Apr 25 2017, 4:44 PM

Process notes: Given that the team has identified in two separate meetings that the scope of this task is too large, this task may benefit from being broken up. Value of that may include:

Work can be done in parallel by multiple team members (this one is not likely, as this task has dependencies and blockers if broken up, anyway)
It may make the task easier to see, in terms of status. That is, it is hard to track the progress on this task, and make adjustments to expectations over time, if it's all happening in one giant task history. There is value in keeping a record, but it's unwieldy at this size.
The task would be more approachable if it were smaller. The Web team has frequently identified a challenge of picking up tasks with large (5+ points) scope, and often those tasks get skipped over even though they are the highest priority. This task has a better chance of someone doing it if it is in discrete chunks (even though one person may end up doing all those chunks themselves). This task is also not pointable, as it is a spike, but one recommendation for time commitment was "one week", which is a ton in a world where the team's spikes tend to max out at "8 hours".

I just saw this... I think there's a big missing piece of interface for this... or.. is this task only about figuring out the backend/technical way of joining pdfs?

@ovasileva the engineering assessment request feels bit premature given the progress we have on product and design so far. as this card is on next sprint, there are a lot of open questions around what product feature it will serve. In absence of @ovasileva > Moving it "Needs Analysis" till we decide on that.

@Nirzar @ovasileva let's setup a meeting?

Tgr moved this task from Backlog to Tracking on the Product-Infrastructure-Team-Backlog-Deprecated board.May 2 2017, 5:22 PM

Tgr removed a project: Reading-Infrastructure-Team-Old (Don't use).

@Jdlrobson, @Nirzar - yes please. Current state of mind is, however, that we'll be replicating the back-end functionality with the current UI. Changing the UI will be a separate and future process. The only new design work here would be the new print styles. I'll set up a meeting to discuss, but I would highly recommend making space in the upcoming sprint to account for this.

Apologies for all the questions... As an outside to the OCG work, I want to help get a shared understanding of the plan and the problems here. This task does seem to be about technical choices for concatenating multiple articles https://phabricator.wikimedia.org/T163272#3210650 but as written that's not clear so we should look to improve the description. These questions may help.. I hope!

In T163272#3207989, @bmansurov wrote:

To summarize, we need to find a way to make transformations to HTML/wikitext before rendering a PDF. The requirements for the transformations are listed in the acceptance criterion "Concatenated PDFs must include the following:" and it's sub-bullets.

It seems like the question to answer here is should we do this in PHP e.g. via HTMLFormatter (that's in core) or a node service. If so, shouldn't the main concerns here be around cacheability/performance? I don't see any mention in the spike around considering these two or what are expectations are. It feels like we could achieve the transformations (when we decide what they are) in either option.

Would it make sense to define the required transformations first e.g. propose a list of what needs to happen and then assess the two options against those? I see @Tgr made a suggestion here: https://phabricator.wikimedia.org/T163272#3210650
Why not start by fleshing those out and then ask the simple question do we do these in PHP or Node?

we would like to investigate replicating some or most of OCG's functionality using the Electron PDF service. Namely, we would like the ability to concatenate articles and allow transformations which direct the look and feel of Electron.

It sounds like the background here is it should be possible to combine pages into a new article that can be printed.
Is there any upper limit on the number of pages that can be combined? For instance could I combine every page in the wiki? How do lists of pages get generated? Via user input or via an api output from say categories/Linkshere?

PDF generation must be triggered from the book creator and from download as PDF links (we must be able to generate PDF's for single and multiple articles)

Can this be summarised as there is a URL I can POST or GET that returns a PDF? Is this relevant here with respect to the decision over whether to use Node or PHP service?

Users will be able to select between a two-column and single-column layout, where the two-column layout will render using OCG and the single column layout will render using electron. This must be available for both books and individual articles (similar to current implementation on mediawiki

It would help to define what goes into a 2 column layout. Is this relevant here? This could be done via CSS using https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Columns/Using_multi-column_layouts regardless of technology choice.

Table of contents must be clickable - selecting a link from the table of contents must navigate to the correct position within the article

Seems like all this can be condensed into the statement - it should be possible to create a table of contents based on the new structure of the concatenated article.

All tables and infoboxes available in the original articles

Again I'm not sure how this is relevant to the technology choice.

Chapter structure - each article must be numbered as a chapter and marked accordingly in the table of contents

Sounds like this means turning h1s into h2s etc.. where the page title is the h1.

This is problematic as some page content may contain h1s so some normalisation would need to occur
Consider concatenating the two pages

Lead section of page 1
= Page 1 heading =
Text

and

Lead section of page 2
== Page 2 heading ==
Text

How should these be combined?
Also what happens to h6s if this normalisation does occur (h7s do not exist.. do these stay h6s or do they drop out the table of contents?)

References will appear individually at the end of each article. If links are available within the references, they will be available within the created PDF

Does this mean a concatenated article will have a single references section? What about articles where references are in multiple sections e.g. Notes / References will these be merged?

Blue links - all blue links will be available within the PDF. Blue links will be styled differently

I'm not sure I understand what this means. What are blue links in this context? Can you expand?

Contributions:

all text contributors
all image contributors

I'm not quite sure I understand what this means. Does this mean we want to list every single possible contributor for an article by name? If so, that sounds like a big performance hog regardless of where it's done...

Content license

What if there are multiple licenses e.g. Wikidata?

As determined by a previous meeting, we are pulling this into the sprint board needs analysis column until questions on the task are answered.

ovasileva moved this task from Needs Prioritization to 2016-17 Q4 on the Web-Team-Backlog board.May 9 2017, 12:29 PM

Tgr moved this task from Tracking to Kanban on the Product-Infrastructure-Team-Backlog-Deprecated board.May 10 2017, 9:25 PM

Tgr edited projects, added Product-Infrastructure-Team-Backlog-Deprecated (Kanban); removed Product-Infrastructure-Team-Backlog-Deprecated.

Tgr moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

Jdlrobson mentioned this in T160128: MFA: Spike [8hr] Review CSS code conventions and improve to minimise chance of UI regressions.May 12 2017, 2:54 PM

• bmansurov claimed this task.May 15 2017, 5:14 PM

• bmansurov moved this task from To Do to Doing on the Readers-Web-Kanbanana-Board-Old board.

In T163272#3245020, @Jdlrobson wrote:

It seems like the question to answer here is should we do this in PHP e.g. via HTMLFormatter (that's in core) or a node service. If so, shouldn't the main concerns here be around cacheability/performance? I don't see any mention in the spike around considering these two or what are expectations are. It feels like we could achieve the transformations (when we decide what they are) in either option.

Shouldn't we look at OCG request numbers first before worrying about caching? Maybe caching is not needed at this stage of development. Your point is still valid though. I was hoping our choice of PHP or JS would help us make this decision too. For example, I was going to push for using RESTBase for caching benefits.

Would it make sense to define the required transformations first e.g. propose a list of what needs to happen and then assess the two options against those? I see @Tgr made a suggestion here: https://phabricator.wikimedia.org/T163272#3210650
Why not start by fleshing those out and then ask the simple question do we do these in PHP or Node?

The required transformations are already listed in the description of the task. See the bullet point "Concatenated PDFs must include the following:"

Is there any upper limit on the number of pages that can be combined? For instance could I combine every page in the wiki? How do lists of pages get generated? Via user input or via an api output from say categories/Linkshere?

The goal of the spike is not to worry about the front-end of OCG, we'd like to replace the back-end only for now. Whatever the current settings are, they will remain unchanged. The list is generated by enabling the book creator feature and adding pages to the list. The feature can be found by going to the main page and looking for "Create a book" on the left navigation pane.

PDF generation must be triggered from the book creator and from download as PDF links (we must be able to generate PDF's for single and multiple articles)

Can this be summarised as there is a URL I can POST or GET that returns a PDF? Is this relevant here with respect to the decision over whether to use Node or PHP service?

Yes, I think we can summarize as you did. I also think that line was written from the product perspective.

Users will be able to select between a two-column and single-column layout, where the two-column layout will render using OCG and the single column layout will render using electron. This must be available for both books and individual articles (similar to current implementation on mediawiki

It would help to define what goes into a 2 column layout. Is this relevant here? This could be done via CSS using https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Columns/Using_multi-column_layouts regardless of technology choice.

If I'm not mistaken what's being requested is that we let OCG render the two-column layout, and the new electron service should only worry about the single column layout of the same article. With the current UI the user has a choice to choose either.

Table of contents must be clickable - selecting a link from the table of contents must navigate to the correct position within the article

Seems like all this can be condensed into the statement - it should be possible to create a table of contents based on the new structure of the concatenated article.

The new requirement says "Table of contents must contain the individual table of contents for each article as subsections", so it's not necessary to combine all articles and then generate the table of contents.

All tables and infoboxes available in the original articles

Again I'm not sure how this is relevant to the technology choice.

I think the current implementation of OCG has a problem with rendering tables. We're making sure that our technology choice doesn't have this limitation.

Chapter structure - each article must be numbered as a chapter and marked accordingly in the table of contents

Sounds like this means turning h1s into h2s etc.. where the page title is the h1.

Yes, or something to that effect.

This is problematic as some page content may contain h1s so some normalisation would need to occur
Consider concatenating the two pages
Lead section of page 1
= Page 1 heading =
Text
and
Lead section of page 2
== Page 2 heading ==
Text
How should these be combined?
Also what happens to h6s if this normalisation does occur (h7s do not exist.. do these stay h6s or do they drop out the table of contents?)

I think we'll have to do some kind of normalization here as there should be only one H1 on every article (and that's the article title). If there are H1's in the body, we'll have to change to H2's, etc.

References will appear individually at the end of each article. If links are available within the references, they will be available within the created PDF

Does this mean a concatenated article will have a single references section? What about articles where references are in multiple sections e.g. Notes / References will these be merged?

No, it means every article will have its own references section, just like its own table of contents.

Blue links - all blue links will be available within the PDF. Blue links will be styled differently

I'm not sure I understand what this means. What are blue links in this context? Can you expand?

Links that point to a valid URL are blue links, I reckon. I think the requirement is saying that we should keep working links as is in the generated PDF, we should not remove links from the output.

Contributions:

all text contributors
all image contributors

I'm not quite sure I understand what this means. Does this mean we want to list every single possible contributor for an article by name? If so, that sounds like a big performance hog regardless of where it's done...

Take a look at the current implementation output starting page 31:

Book (1) ocg.pdf3 MBDownload

Content license

What if there are multiple licenses e.g. Wikidata?

Whatever is returned by the API will be output. For example, https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=rightsinfo

@Tgr here's what I came up with. Please take a look and let me know if you have any questions.

extension-OfflineContentGenerator-bundler seems to have the majority of the functionality implemented. So extracting what the extension has may lead to a speedier outcome.

As for concatenation, RESTBase returns HTML of pages, e.g. Book, that we can use to generate books. Here are few points to keep in mind:

The end point doesn’t return table of contents though. This can be remedied by iterating over HTML and generating the table of contents on the client side.
We also need to make individual requests to retrieve articles' contents one-by-one, i.e. no batch retrieval of HTML is possible, it seems.
Since RESTBase generated HTML element ID’s and class names collide, we’ll have to namespace those elements, e.g. using page ID.

Contributors, images (https://en.wikipedia.org/w/api.php?action=query&titles=File%3ABook_Collage.png&prop=imageinfo&iiprop=url|size|mediatype|mime|sha1|extmetadata), and licence info can be retrieved from the MW API endpoint. For caching purposes we may have to create RESTBase endpoints for the above URLs.

ovasileva updated the task description. (Show Details)May 16 2017, 12:47 PM

Re: performance, do we expect concatenated HTML to be exposed directly to users in some use cases? Do we expect HTML concatenation to be slower than or comparable to HTML -> PDF transform? If we expect neither then choosing the concatenation tool based on performance is probably not a useful optimization.

Memory use might be a more important concern - say I want to put 100 articles in a book, so the total HTML is something like 100MB. Will that work?

Caching is more of an issue of URL structure / uniqueness than language choice.

In T163272#3265901, @bmansurov wrote:

extension-OfflineContentGenerator-bundler seems to have the majority of the functionality implemented. So extracting what the extension has may lead to a speedier outcome.

The bundler just fetches stuff and puts them in a directory (and sends progress notifications so that the UI can display progress) and creates an sqlite db with attribution etc. metadata, right?
Some of that code might be reused if the logic will live in RESTBase but it does not seem like too much effort to rewrite. (I didn't read through the code though, just glanced at it, so I might be missing part of the logic.)

We also need to make individual requests to retrieve articles' contents one-by-one, i.e. no batch retrieval of HTML is possible, it seems.

Even if it were possible (it probably is for ParserCache for example) chances are we win more by fetching Varnish-cachable URLs then by doing things in a single request.

Since RESTBase generated HTML element ID’s and class names collide, we’ll have to namespace those elements, e.g. using page ID.

That would mean we have to process CSS as well, unless we can limit ID/class renaming to things generated by RESTBase and not used for styling. (Also TOC links since those can collide as well.)
Another option is to generate a PDF per page and do PDF concatenation. Seems painful though.

@Tgr, thanks. Also, what are some of the reasons for doing this in PHP?

Does not require another service to be created, can be used on third-party wikis with no support for node services.

I will somewhat immodestly plug my own library (RemexHtml) for this.

There's basically two ways to use RemexHtml:

You can use RemexHTML to parse HTML into a DOMDocument, then modify the DOMDocument, then use RemexHtml to serialize the DOMDocument back to HTML
You can use RemexHTML's streaming feature, to modify and reserialize the document on the fly, without building a full DOM.

The first is easier and the API is probably more stable, whereas the second has lower memory usage if the document is large. We use the streaming mode for Wikipedia articles on the basis that a single article is large -- if you're concatenating a bunch of articles then this would presumably fit the definition of "large" even better.

I'm not aware of any other fully compliant HTML5 parser in any language that has this streaming mode. The HTML 5 spec is written in a way which assumes the existence of a DOM, and it took a lot of work to map those ideas to an event stream with aggressive memory release of unnecessary state.

Node.js not only lacks an HTML5 parser with this streaming feature, it also lacks a native DOM library. Parsoid uses Domino, which has non-ideal performance characteristics due to its use of plain JS arrays, like insertion or removal of nodes in the middle of a node list taking O(N) time in the number of sibling nodes.

RemexHtml does require you to store the HTML for the entire document in memory (input, and one output buffer per stack level). If that's a problem for your application then maybe we can adapt it.

RemexHtml probably only has one user (MW), so at this point you can hack it to do whatever you want, and as long as the MW tests still pass, it's all good.

@tstarling, thanks. Has there been any comparative analysis of the performance of RemexHtml's streaming mode and other open source libraries? I think it's a good idea to use this library given that it's already being used in MW and it has a unique streaming feature.

I also think that regardless of the library we choose, we have to generate PDF's of articles separately (as @Tgr suggests in T163272#3270069 -- although I'm not sure of its pain points). The reason is that we don't know whether a user will want to generate a book containing ten articles or a thousand. At some point even the most performant parser will hit its limits. Looking at the example structure in the description, I see that only the table of contents, contributors, and licenses need to be combined (each article can be parsed separately). I suppose these will be a small portion of the book.

Another issue with parsing all articles in one go is that the electron PDF service may end up being a bottleneck as it has to re-parse the generated document in order to convert it to PDF.

• bmansurov updated the task description. (Show Details)May 19 2017, 8:17 PM

Upon further investigation, I found that the electron-render-service doesn't allow generating table of contents with page numbers. The generated PDF won't even have PDF outline (which is useful for navigation). Generating table of contents in HTML is possible, but we cannot add page numbers as the PDF has to be laid out before we know which section is on what page. For this reason, I've looked around and found another open-source library (wkhtmltopdf licensed under LGPLv3) that generates well formatted PDFs with outline and table of contents.

What is also great is that the library can generate a PDF from multiple URLs. We don't have to worry about namespacing article ID's or generating individual PDFs and combining them ourselves. A brief comparison of memory usage and render time with the electron-render-service yielded in comparable, if not slightly better, numbers.

Without any transformations (such as adding license or contributor info, or breaking pages into chapters), here is an outcome PDF:

fruits.pdf2 MBDownload

using the command

./wkhtmltopdf cover http://mw.loc/w/cover.html toc page https://en.wikipedia.org/api/rest_v1/page/html/Apple/781322367/8461718d-3d68-11e7-86c3-bba2fc26f3f6 https://en.wikipedia.org/api/rest_v1/page/html/Pear https://en.wikipedia.org/api/rest_v1/page/html/Cherry https://en.wikipedia.org/api/rest_v1/page/html/Grape https://en.wikipedia.org/api/rest_v1/page/html/Persimmon --print-media-type fruits.pdf

Assuming that we'll be using the above library for generating PDF's, here are the transformations we need to do.

Generate a cover page in HTML and feed it to the rendering library;
For each article:
- Create a title with the chapter number, e.g. "1. Apple"
- Prefix section titles with the chapter number and section number, e.g. "1.1. Botanical Information"
Since articles can be grouped into chapters on Special:Book, we need to make each article a subsection of a chapter if the article is a part of a group. For example, if I'm interested in creating a book about fruits and vegetables, I may have two chapters called "1. Fruits" and "2. Vegetables". The article "Apple", would go under "1. Fruits" and be titled as "1.1. Apple". Sections of the article would be prefixed with "1.1.1.", "1.1.2", etc.
Create an HTML file with that has information about "Text and image sources, contributors, and licenses" and feed it to the library too.

@bmansurov - it looks pretty good. If we use this library, do you know how much we can customize the styles themselves/what restrictions we will have?

Print styles can be fed to the library. I don't see any restrictions.

Concatenation is not particularly hard after or before the PDF conversion, either. PDF outlines could also be added by a separate tool (although it is a bit of a pain). Adding page numbers to the TOC is not possible without dedicated functionality in the converter tool though (target-counter() from CSS3 Generated Content for Paged Media could do it but it is not supported by any browser at this time).

OTOH wkhtmltopdf/QtWebKit in written in C++, so probably harder to debug issues / less secure. I also wonder how reliable it is? It seems to be a community-maintined fork of webkit proper.

Chapter/section numbering can be done with CSS counters.

Reference numbers link to live Wikipedia instead of scrolling the document, which is rather crappy (although OCG does that too). Would electron handle that correctly? "Save to PDF" in Chrome does.

Other transformations that are probably needed:

remove red links
"push down" headings when the page has a =-level section (or should we ignore this? it's super rare.)
- "push down" headings when the book has chapters

• bmansurov mentioned this in T166188: Architecture of new rendering backend for Extension:Collection.May 23 2017, 10:56 PM

Yes, target-counter seemed abandoned. It also was not clear to me whether section numbers added with CSS would be picked up by the PDF renderer while creating the table of contents with section numbers.

Good catch on reference links. I'll add the transformation of reference links and the others you mentioned to the combined list in the description.

I'll mention what we talked about during the yesterday's meeting for others. As for debugging wkhtmltopdf we decided to go with it on the basis that the electron-render-service also uses third party libraries that are written in C, etc. and has the same issues mentioned above. Also we decided to go with wkhtmltopdf because we didn't want to spend too much time concatenating various intermediate PDFs, and generating the table of contents and outline using yet another tool.

• bmansurov updated the task description. (Show Details)May 24 2017, 3:18 PM

^ Per T163272#3289229.

• bmansurov removed • bmansurov as the assignee of this task.May 25 2017, 12:11 PM

pmiazga claimed this task.May 31 2017, 5:01 PM

Looks pretty nice. @bmansurov great job! I think now we have everything and we can start working on the production-like code. I'm closing this task.

pmiazga closed this task as Resolved.May 31 2017, 9:45 PM

• bmansurov updated the task description. (Show Details)Jun 1 2017, 3:06 PM

ovasileva mentioned this in T171838: Build out article concatenation according to requirements for books.Jul 27 2017, 11:24 AM

• bmansurov mentioned this in T171964: [Spike - 8 hrs] Where should article concatenation be implemented?.Aug 9 2017, 11:09 PM

	F8161474: fruits.pdf
	May 22 2017, 6:52 PM

	F8093063: Book (1) ocg.pdf
	May 15 2017, 10:26 PM

[Spike] Determine changes necessary for concatenation support Closed, ResolvedPublicActions