Page MenuHomePhabricator

Architecture of new rendering backend for Extension:Collection
Closed, DeclinedPublic

Description

We’d like to create an extension that generates a PDF from a list of articles. This task contains the plan for creating a new HTML->PDF backend for Extension:Collection. We’ll use wkhtmltopdf to generate PDF’s. Debian has a package for it. Below is an initial rough draft.

The main use case for the generated PDF is that it will be used as a printed book. For that the tool needs to be able to generate a PDF that has a table of contents with page numbers. The electron-render-service used in Extension:ElectronPdfService does not have this capability. Another use case is that the generated PDF will be usable on a computer, where items in the table of contents, and links are clickable. The PDF should also have an outline for easy navigation. The electron-render-service doesn't have this capability either. What differs the new extension from the existing Offline Content Generator service is that the extension will be able to output tables.

Alternatively, we could in theory render a PDF using electron, and then add page numbers and the table of contents with page numbers using another tool. If we go that route we'll still have to depend on another toolkit to do the job. I've looked at Pdftk and it seemed abandoned. The latest version appeared about 4 years ago. Another library I checked out was QPDF, whose latest release (version 6.0.0) was at the end of 2015, although there's been some activity at github since then. On the other hand the latest stable release (version 0.12.4) of wkhtmltopdf was done at the end of 2016. There maybe other tools that we can use, and I'm open to exploring them. However, out of the above 3 tools, wkhtmltopdf is both new and the easiest to deal with. It's easy because with the other tools, we'll have to use electron first, and then do other transformations to the PDF. I'm not even sure if those tools support the requirements we have.

So, the extension will be used as one of the back-ends to Extension:Collection. It will expose a couple of end-points.

One of the end points will receive a payload in the metabook format and start rendering a PDF. The extension will retrieve HTML versions of articles from RESTBase. It will also retrieve metadata such as authors of images from the MediaWiki API. It then makes transformations (identified in T163272) of HTML pages (or creates other HTML pages such as the cover page) using RemexHtml (as was suggested in T163272#3272877) and saves them in the file system. It will then call wkhtmltopdf with the HTML file names as parameters (as shown in T163272#3284896) to generate a PDF. The PDF will be saved in the file system with a unique name.

While we can concatenate HTML files into one and generate a PDF, we don't have to as wkhtmltopdf allows us to pass multiple pages and generates a singel PDF. This is especially nice because we won't have to worry about ID collisions which will happen in case of concatenated HTML.

Temporary HTML files created for the purpose of generating the PDF will be immediately removed from the file system as they won’t be needed for creating other PDFs (because it’s unlikely that other books will have the same structure as the one that’s been generated). The PDF however will be kept in the file system for some period so that we can serve it without re-rendering it. Every so often we’ll have to clean up old PDF files. How often?

The extension also exposes an end point for retrieving the render status of a collection. This end point will be used by Extension:Collection to periodically check whether the requested PDF is ready.

Open questions

  • What details should we clarify before working on the extension? @Tgr what do you think?
  • What problems may the above setup cause from the operations perspective? @faidon I'm curious to hear your opinion.
  • Other?

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Is there any relation between this task/ wkhtmltopdf and Electron-PDFs/T150871 ? Or not because this is about HTML to PDF?
(Asking as the PDF creation software stack is already confusing enough for average users...)

@Aklapper the proposed solution will replace the existing OCG service. I think this task can be filed as part of T150871.

ovasileva moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.

@bmansurov: Thanks. In that case, feel free to add as a subtask/parent task wherever appropriate (via "Edit Related Tasks...").

Several things that are still unclear and/or feel as bad ideas to me:

  • We currently have OCG deployed, which generates PDFs via LaTeX. We also have an experimental PDF service deployed that uses Electron (headless Chromium), supposedly to replace OCG. I'm not terribly familiar with wkhtmltopdf but it sounds like a new, third system?
  • Is this a MediaWiki extension which uses RESTBase to get a page's HTML? That sounds very odd, could you explain that choice a little bit?
  • Even if this is a good idea, transforming the page another time, in PHP this time, sounds odd as well. The number of different abstraction layers/languages/stacks that are going to manipulate the HTML content (and the formal boundaries between these layers) should be a consideration.
  • I don't understand how the extension "will also retrieve metadata such as authors of images from the MediaWiki API". Are you talking about an internal MediaWiki API or the Action API? This is from the MediaWiki extension, right?
  • Locally caching HTMLs or PDFs is practically impossible (and a bad approach in general). Among other concerns, this makes the MediaWiki workers stateful and requires an affinity of user->server (we operate hundreds of appservers and traffic gets to them on a round-robin basis).

In general, I think this would benefit for input from a broader audience and especially other backend developers (@Tgr is a good start :) and/or the ArchComm. It's also not the first time we've talked about all those concepts; OCG was a redesign of a previous system for example, and Electron was introduced just a few months ago; I feel like every time we talk about the broader space of PDF generation we start the conversation from the beginning and end up making poor design choices again (sometimes even the same ones).

A couple of comments:

  • I would recommend to use the existing HTML to PDF functionality in our Electron render service install, rather than trying to use yet another tool (wkhtml2pdf) for this. We evaluated wkhtml2pdf, and it was inferior in every way. It is based on an old webkit version, does not preserve links, and is scary for security.
  • To use electron, all you need to do is to expose the HTML you want to have rendered as a URL somewhere, essentially as an API. I would strongly suggest not using any local filesystem storage, for security and simplicity.
  • The merged HTML end point requires a bit of HTML DOM manipulation, and takes a list of domains / titles as an input. It could be implemented in a variety of ways, one of which would be as a new entry point in MCS.

Thanks, @faidon.

Several things that are still unclear and/or feel as bad ideas to me:

  • We currently have OCG deployed, which generates PDFs via LaTeX. We also have an experimental PDF service deployed that uses Electron (headless Chromium), supposedly to replace OCG. I'm not terribly familiar with wkhtmltopdf but it sounds like a new, third system?

Yes, wkhtmltopdf is a new system.

  • Is this a MediaWiki extension which uses RESTBase to get a page's HTML? That sounds very odd, could you explain that choice a little bit?

The main benefit is caching. We could hit the MediaWiki action API directly, but that does seem like a performance issue.

  • Even if this is a good idea, transforming the page another time, in PHP this time, sounds odd as well. The number of different abstraction layers/languages/stacks that are going to manipulate the HTML content (and the formal boundaries between these layers) should be a consideration.

Are you saying we should stick to node.js?

  • I don't understand how the extension "will also retrieve metadata such as authors of images from the MediaWiki API". Are you talking about an internal MediaWiki API or the Action API? This is from the MediaWiki extension, right?

Action API. And yes, from the MediaWiki extension.

  • Locally caching HTMLs or PDFs is practically impossible (and a bad approach in general). Among other concerns, this makes the MediaWiki workers stateful and requires an affinity of user->server (we operate hundreds of appservers and traffic gets to them on a round-robin basis).

It doesn't have to be local, I guess. Anythink like Amazon S3 will do it.

In general, I think this would benefit for input from a broader audience and especially other backend developers (@Tgr is a good start :) and/or the ArchComm. It's also not the first time we've talked about all those concepts; OCG was a redesign of a previous system for example, and Electron was introduced just a few months ago; I feel like every time we talk about the broader space of PDF generation we start the conversation from the beginning and end up making poor design choices again (sometimes even the same ones).

The reason we're doing this is that we need to sunset OCG, while keeping the functionality intact. The current OCG doesn't render tables, so we're going HTML->PDF route this time. We've also looked into electron, it doesn't satisfy some of the requirements such as generating the table of contents with page numbers.

Thanks for input, @GWicke.

A couple of comments:

  • I would recommend to use the existing HTML to PDF functionality in our Electron render service install, rather than trying to use yet another tool (wkhtml2pdf) for this. We evaluated wkhtml2pdf, and it was inferior in every way. It is based on an old webkit version, does not preserve links, and is scary for security.

Once thing that Electron doesn't do is it doesn't generate the table of contents with page numbers. We'll have to let electron generate the PDF, and then use another tool to generate the table of contents from that PDF. We'll also have to generate the PDF outline ourselves. This will get more complicated as I'm not sure if we'll have to update internal links in the PDF once we merge multiple PDFs into one. That's the reason why we chose wkhtmltopdf. Could you link to your evaluation? I'd like to know more.

  • To use electron, all you need to do is to expose the HTML you want to have rendered as a URL somewhere, essentially as an API. I would strongly suggest not using any local filesystem storage, for security and simplicity.

I'm open to using any other storage. What would you suggest?

  • The merged HTML end point requires a bit of HTML DOM manipulation, and takes a list of domains / titles as an input. It could be implemented in a variety of ways, one of which would be as a new entry point in MCS.

This is also the proposed solution. The list of domains and titles will be inside the metabook linked in the description.

Thanks for input, @GWicke.

A couple of comments:

  • I would recommend to use the existing HTML to PDF functionality in our Electron render service install, rather than trying to use yet another tool (wkhtml2pdf) for this. We evaluated wkhtml2pdf, and it was inferior in every way. It is based on an old webkit version, does not preserve links, and is scary for security.

Once thing that Electron doesn't do is it doesn't generate the table of contents with page numbers. We'll have to let electron generate the PDF, and then use another tool to generate the table of contents from that PDF.

Electron can support an inline TOC with links to sections, but you are right that it does not support a PDF TOC with PDF page numbers. I assume that you are interested in printed PDFs, in which case clickable links won't work? If the main use is screens, then clickable links might be good enough, and they will work in Electron.

We'll also have to generate the PDF outline ourselves. This will get more complicated as I'm not sure if we'll have to update internal links in the PDF once we merge multiple PDFs into one.

I was assuming that you'd feed a single HTML file containing all articles to the render service, rather than concatenating PDFs.

That's the reason why we chose wkhtmltopdf. Could you link to your evaluation? I'd like to know more.

See T134205.

  • To use electron, all you need to do is to expose the HTML you want to have rendered as a URL somewhere, essentially as an API. I would strongly suggest not using any local filesystem storage, for security and simplicity.

I'm open to using any other storage. What would you suggest?

For single article PDFs, we simply cache the output in Varnish. The service itself is completely stateless. Electron is fairly fast, and collection use is rare. For example, the service renders a 440-page wikibook in ~36s (on a small labs instance): T142226#2537844

Electron can support an inline TOC with links to sections, but you are right that it does not support a PDF TOC with PDF page numbers. I assume that you are interested in printed PDFs, in which case clickable links won't work? If the main use is screens, then clickable links might be good enough, and they will work in Electron.

Yes, we want printable PDFs. I've updated the description to include this information.

I was assuming that you'd feed a single HTML file containing all articles to the render service, rather than concatenating PDFs.

If we use electron-render-service then even when we feed a sing HTML, we'll still have to generate another PDF with the table of contents and then concatenate that with the one generated by electron-render-service. With wkhtmltopdf the desired PDF will be generated with a single command.

See T134205.

Thanks. I'll read through that task and update this one as needed. I'm also eager to learn about security issues that you mentioned.

For single article PDFs, we simply cache the output in Varnish. The service itself is completely stateless. Electron is fairly fast, and collection use is rare. For example, the service renders a 440-page wikibook in ~36s (on a small labs instance): T142226#2537844

Using wkhtmltopdf I've generated about a 700-page PDF on my development machine in about 30 seconds. Although it doesn't tell much about the performance of wkhtmltopdf, I felt it was performant enough given the extra work (like generating the table of contents, or the ability to execute JS, etc.) it does compared to electron-render-service. If the performance is a concern, I can do more tests.

Some feedback from the services meeting today:

@GWicke concerned about complexity of solution. Is it a product decision that all these features need to be here? Are there any features we can throw away in the migration. The more we need to support, the added maintenance work going forward. @ovasileva is anything negotiable. According to @pmiazga table of contents is the biggest concern here ( see https://phabricator.wikimedia.org/T166188#3289727 )- it is not possible with Electron.

There were concerns raised in the group that we might end up with maintaining 2 different things (e.g Electron and wkhtmltopdf) and that we should be talking more to WMDE to make sure we avoid this scenario.

@Jdlrobson - table of contents with page numbers is one of the strict product requirements for books. If we decide to go with wkhtmltopdf, I think it should be the solution for both single articles as well as books. I've asked @bmansurov and @pmiazga to begin with looking at single articles with wkhtmltopdf so we can evaluate the use cases for WMDE (namely tables). If we confirm the ability to print tables with wkhtmltopdf, we should reach out to WMDE and talk about doing a full swap between electron to wkhtmltopdf. @Jdlrobson, @GWicke - is the main concern maintaining two services at a time or the complexity of wkhtmltopdf itself?

is the main concern maintaining two services at a time or the complexity of wkhtmltopdf itself?

The former. Having two solutions for converting <insert-format-here> into PDF is not sustainable and creates significant maintenance overhead. Ideally, we want one solution that satisfies both needs. I have already suggested that it would a good idea to merge your requirements and those of WMDE so that alternatives can be evaluated fully and against all requirements.

As to wkhtmltopdf, it should be noted that it is a command-line tool, and for security reasons shelling out to it will most likely not be allowed in production. That is to say that if this tool is chosen, it has to be converted into a service.

I don't understand why the conversation has focused on inline TOCs and whatever other little detail. The broader issue is that the Reading team is coming up with a third PDF renderer while (their) plan to replace the first one with the second one is well underway and is proposing that with unclear requirements, arguments and transitioning plans. This is all pretty chaotic and looks very disorganized to me.

I'm sorry for the bluntness, but this is how it looks from our side: the Reading team has refused to take ownership of OCG for over a year now (~April 2016), despite repeated lengthy conversations about it, while leaving it orphaned to this day, running and serving users in prod and causing problems and unreasonable time expenditures from ops to keep it on life support. It did not write or maintain Electron-the-service but relied on the Services team doing so (despite not really being their core mandate). It promised to at least own the product side of it and worked on transitioning its use cases to it (cf. T150871) in a 9-month timeframe -- one that we're in the middle of and that we recently learned is going to be behind schedule. The cherry on top is that ops, Reading PM (Olga), Reading engineering and Reading/Product leadership met a couple of weeks ago ago in Vienna to talk about the progress of that OCG->Electron migration and this whole new "rendering backend" was not mentioned at all.

The even crazier part is that if this design and service gets implemented, it will be the fourth system to generate PDFs out of of articles or collection of articles in the past three years, with every one of those three transition causing a lot of lengthy conversations, frustration and headaches for everyone involved. Is it really that difficult of a problem to be wasting so much of our time on?

More forward-looking: if the Reading team wants to use wkhtmltopdf for PDF rendering for various reasons, that's obviously your perogative (and everyone is free to argue on the merits of your arguments, obviously, as some people have been doing already). However, you'll need to either decomission Electron or argue that we should incur the (foundation-wide) cost of maintaining both of these renderers for some reason. You'll also need to come up or adjust your migration plans (Electron->wkhtmltopdf? OCG->wkhtmltopdf?) and timelines, communicate them to us and coordinate them with us (ops & services). Depending on how these look like, we may have some input and they may need to be altered -- I can tell you already that having three renderers all in production overlapping is going to be quite a hard sell. The same applies to the extension of the OCG sunsetting deadline, as we already discussed a couple of weeks ago.

Finally on a semi-separate note: could we please stop talking about "WMDE's" requirements? Electron was deployed as an OCG replacement with quite a bit of involvement from Reading, after a series of conversations on Phabricator, email and meetings. Reading is supposed to own the product side of it -- and @ovasileva has actually been doing that and doing it well as far as I can tell. There are no separate "WMDE" requirements, these are our collective requirements that we all devised together months ago.

Several things that are still unclear and/or feel as bad ideas to me:

  • We currently have OCG deployed, which generates PDFs via LaTeX. We also have an experimental PDF service deployed that uses Electron (headless Chromium), supposedly to replace OCG. I'm not terribly familiar with wkhtmltopdf but it sounds like a new, third system?
  • Is this a MediaWiki extension which uses RESTBase to get a page's HTML? That sounds very odd, could you explain that choice a little bit?
  • Even if this is a good idea, transforming the page another time, in PHP this time, sounds odd as well. The number of different abstraction layers/languages/stacks that are going to manipulate the HTML content (and the formal boundaries between these layers) should be a consideration.
  • I don't understand how the extension "will also retrieve metadata such as authors of images from the MediaWiki API". Are you talking about an internal MediaWiki API or the Action API? This is from the MediaWiki extension, right?
  • Locally caching HTMLs or PDFs is practically impossible (and a bad approach in general). Among other concerns, this makes the MediaWiki workers stateful and requires an affinity of user->server (we operate hundreds of appservers and traffic gets to them on a round-robin basis).

In general, I think this would benefit for input from a broader audience and especially other backend developers (@Tgr is a good start :) and/or the ArchComm. It's also not the first time we've talked about all those concepts; OCG was a redesign of a previous system for example, and Electron was introduced just a few months ago; I feel like every time we talk about the broader space of PDF generation we start the conversation from the beginning and end up making poor design choices again (sometimes even the same ones).

Apoogies for not responding sooner. To me, wkhtmltopdf does not seem architecturally different enough from electron to make the switch a big deal. Both would be stateless REST APIs. Both would run as a node webservice, listen to incoming connections, receive the HTML through such a conncection, pass it to some browser-based binary, get back a PDF file, and sent it back. For electron that node service already exists, for wkhtmltopdf it would have to be written, which is extra work, but it's not that complicated. There are reasons to favor electron, but they are minor, and if TOC numbers are a hard requirement, and we cannot create them in HTML (worth a try but I did not see a robust way to do it) then that's how it is.

We shouldn't maintain two different HTML-to-PDF pipelines, of course, so wkhtmltopdf would replace electron in this case.

As for multiple layers of abstraction, that seems like a good thing to me. Electron is now conceptually super simple: you pass a HTML page, you get back a PDF rendering. It does not have to know anything about MediaWiki or Wikipedia. That's a nice property we might want to keep; at the same time, we need something that receives a list of page names and creates a combined HTML page that can then be turned into PDF. In OCG that was the bundler; with electron/wkhtmltopdf doing it in the node.js service and in the PHP extension both seem like reasonable options. The PHP DOM library is in-house and technically superior, although that probably does not make a big difference.

A note from the product perspective. We have a separate but similar set of requirements for single-article PDF's and for books, documented here: https://www.mediawiki.org/wiki/Reading/Web/PDF_Functionality#Proposal. Currently, the blocker for using the electron service is providing the table of contents, which is of high priority for the books feature. I agree that there are no specific WMDE requirements, however, the original motivation behind choosing electron was based on the ability to print tables. Prior to selecting wkhtmltopdf, we would need to ensure its ability to render tables: T167117: [Spike 6hrs] Investigate ability of wkhtmltopdf to render single articles and evaluate the difficulty of supporting a toc in electron: T167210: [EPIC] Adding PDF TOC with PDF page numbers to electron. I just want to highlight that we are not discussing supporting both options but selecting the one which satisfies our high-priority functional requirements.

For what it's worth, Headless Chromium has now reached stable (Electron is sometimes called "Headless Chromium" as it had a headless mode before standalone Chrome did, but I'm referring to actual Headless Chromium, meaning a plain chromium --headless invocation from the command-line with no additional packages or wrappings).

One of the built-in features in Chromium's Headless mode is the ability to print-to-pdf any page.

It was featured in this weeks' release notes for Chrome 59.
https://developers.google.com/web/updates/2017/05/nic59

Headless Chrome

[..] For example:

  • Using Selenium for unit tests against your progressive web app
  • To create a PDF of a Wikipedia page
  • Inspecting a page with DevTools

[..] It brings all modern web platform features provided by Chrome to the command line.

For what it's worth, Headless Chromium has now reached stable (Electron is sometimes called "Headless Chromium" as it had a headless mode before standalone Chrome did, but I'm referring to actual Headless Chromium, meaning a plain chromium --headless invocation from the command-line with no additional packages or wrappings).

One of the built-in features in Chromium's Headless mode is the ability to print-to-pdf any page.

It was featured in this weeks' release notes for Chrome 59.
https://developers.google.com/web/updates/2017/05/nic59

Headless Chrome

[..] For example:

  • Using Selenium for unit tests against your progressive web app
  • To create a PDF of a Wikipedia page
  • Inspecting a page with DevTools

[..] It brings all modern web platform features provided by Chrome to the command line.

We had it on the radar as a promising future alternative in our last evaluation round (see T134205#2359999), so it's nice to see this be ready now. There is a decent description at https://developers.google.com/web/updates/2017/04/headless-chrome. Example commandline: chrome --headless --disable-gpu --print-to-pdf https://en.wikipedia.org/wiki/Foobar

At a high level, it sounds like we are all on the same page about maintaining only a single browser-based HTML-to-PDF render service going forward. I personally don't care much about which one it is, as long as it is reasonable along a few criteria. These are the criteria I care about:

  • Security: What are the security properties of the solution? Are there sandboxing / isolation mechanisms? Are there going to be timely security updates going forward?
  • Robustness: Any risks that could cause significant ongoing maintenance / operations effort?
  • Features: Extra features like PDF TOC / collection support.
  • Tracking latest browser tech: Browser print CSS support is still relatively limited, which impacts rendering quality. Browsers are continuing to improve on this front, so it would be good if the solution was closely tracking the development of a major browser, or is easily swappable against a solution that does.

Feature-wise wkhtmltopdf seems to be the clear winner right now (although that might change as CSS3 Paged Media browser support will pick up, since all the other solutions use significantly more recent browsers).

In every other aspect it trails far behind:

  • security: wkhtmltopdf is written in C++ where common security bugs like buffer overflow can be quite severe. (Of course the browser itself is written in C++ in all cases but the Google security team is way more reliable than the wkhtmltopdf one.)
    • also Electron has some level of sandboxing (since the rendering thread is being managed by a Node thread), wkhtmltopdf does not as far as I understand (which is not very far though).
  • robustness: debugging C++ code, eek.
  • up-to-date-ness: Electron uses Chrome 56 (released end of January). wkhtmltopdf uses QtWebKit 5.2 (2013-ish). Also their upgrade process seems more messy in general, partly because there are two steps (both the QtWebkit and the wkhtml project needs to keep up-to-date, and neither did a good job in the past), partly because they do not embed the browser so much as cherry-pick a random set of patches so basically they have their own webkit version. (Again, scary security/maintainability issues...) Electron does that too, but to a lesser extent. (And the headless Chrome option mentioned by @Krinkle is of course the actual browser, as fresh as we want it to.)
    • also that means that with Electron/Chrome it's easy to figure out how to accurately debug issues in an actual browser, with wkhtmltopdf not so much. That's a pretty bug difference for designers IMO.

So there would have to be very strong functionality arguments to pick wkhtmltopdf and IMO those do not exist. Both PDF parsing and the vivlio stuff @GWicke mentioned in T167210#3335691 seem like workable options to add a TOC. And an outline seems relatively unimportant if the TOC is clickable anyway.

Thanks @Tgr, that's a helpful comparison :)

Since Chrome 59 is already released and generating PDFs with it is trivial (chromium --disable-gpu --headless --print-to-pdf=obama.pdf https://en.wikipedia.org/wiki/Barack_Obama), plus apparently chrome-launcher was even split off in its own NPM module, what's your opinion of using just a stock Chromium instead of Electron instead?

(Note that jessie doesn't have Chromium 59 yet, but that's a matter of days probably, as Debian follows upstream Chromium for security updates.)

I agree with the idea of evaluating headless Chromium as well, now that the feature is no longer experimental. @ovasileva @bmansurov are there plans to do so?

@mobrovac, I don't think we're planning on evaluating headless Chromium by itself as it doesn't satisfy the product requirements. However, there is a spike about vivliostyle (T168004) that uses Chrome internally (I think). Depending on the outcome, I think we may use Electron rather than wkhtmltopdf.

Headless Chrome vs. Electron seems like a boring comparison from a product POV IMO since it's almost the same thing. Chrome headless is slightly more up-to-date (59 vs. 56 in Electron) but that's not a huge difference. It might be worth evaluating whether Electron should be replaced with it, but that's more of a devops issue and orthogonal to the OCG replacement project (so as long as that one is not done I wouldn't spend time on it).

Maintenance-wise, memory usage under heavy load might be the interesting question as Electron would fork a new Chromium rendering thread for each request and chrome-launcher would fork a full Chromium instance every time.

@bmansurov vivliostyle has a bundled version called Vivliostyle Formatter that includes a headless Chrome (or Electron, not sure). It's not opensource though.

Headless Chrome vs. Electron seems like a boring comparison from a product POV IMO since it's almost the same thing.

Completely agree. From a product perspective, the comparison is between client-side TOC generation & pagination vs. a server side only solution like wkhtmltopdf.

On the operational side, headless Chrome looks like an improvement over Electron on two counts:

  • It eliminates the need for a virtual X server and xpra, which should hopefully resolve the race conditions we have seen especially during Electron restarts (see T159922).
  • Resource management at least in the latest version of Electron is not ideal, with Electron sometimes consuming a lot of memory, and also sometimes entering a hanging state. In comparison, a process is relatively easy to limit in memory & maximum run time.

Baseline latency of printing a locally hosted HTML page using headless Chrome is 140ms on my machine, which is relatively minor considering that median print times using Electron are around 1s.

On the operational side, headless Chrome looks like an improvement over Electron on two counts:

These, plus security-wise relying on Google Chrome is a much better bet than Electron. Electron being behind by 3 Chrome versions isn't particularly encouraging.

But OK, I hear you both that the comparison in the dimensions that you mentioned makes more sense in this iteration. Fair enough!

Electron has caused a couple outages lately by eating up too much memory, presumably due to the increased load as it has been enabled on all wikis a couple weeks ago. Load will increase a lot more once OCG is disabled / discouraged as a rendering option; should we expect a problem there? Or will that take care of itself as OCG decommissioning frees up resources?

@Tgr, Electron is now limited to 2G of memory in the systemd config, so should not affect other services. The main remaining concern is about it sometimes hanging. @mobrovac has created a check / automatic restart patch, see T159922 for the details.

So to summarize, I think we can get by for the time being, but need to address the underlying reliability issue in the medium term.

Re: concatenation with RemexHTML, I put up a proof-of-concept patch at https://gerrit.wikimedia.org/r/#/c/361453/

@Tgr, Electron is now limited to 2G of memory in the systemd config, so should not affect other services. The main remaining concern is about it sometimes hanging. @mobrovac has created a check / automatic restart patch, see T159922 for the details.

So to summarize, I think we can get by for the time being, but need to address the underlying reliability issue in the medium term.

I would call it more of a short-term than a medium-term issue. Automatic restart is not really a fix, but a workaround, which was done because @mobrovac wasn't sure if it was worth investing more time into Electron while there is this whole discussion about replacing it.

Honestly, I'm still confused a little bit about next steps :) Have we decided on Electron vs. wktohtml? If yes, can we fix Electron? If not, I'm not sure why we turned all wikis on, and presumably we won't be turning off OCG until we do, right?

Have we decided on Electron vs. wktohtml?

We haven't.

There are four possible approaches:

  • use wkhtmltopdf, which can add page numbers and generate a table of contents with page numbers from h1..h6 tags, but is not great from a maintenance POV (as mentioned in T166188#3351769)
  • use Electron (or headless Chrome) and load some Javascript library in the HTML page to be rendered, that adds TOC and page numbers. vivliostyle seems to be the only mature solution for that (there is also fiduswriter/pagination.js but that's WebKit-only, and fiduswriter/paginate-for-print which I didn't test yet but from the code it does not look convincing), and while it can add page/TOC numbers it garbles up the site CSS too much to be useful. (See T168004 for details.) I'll contact the developer and see if they can suggest workarounds but so far it does not look like a viable solution.
  • use Electron to render a PDF without page/TOC numbers and then post-process it to add those somehow. I'm looking into that now (the task is T168871); it seems doable in theory but at least in PHP the library support sucks hard so not sure yet how ugly the stack will end up to be.
  • use Electron with standard CSS for page numbers and TOC numbers. Unfortunately no such CSS is supported by Chrome, so we obviously cannot do this, but it is important for our long-term planning - if this happens in, say, a year, there is no point putting too much effort in making the current OCG replacement very robust. We might want to reach out to get a better understanding of browser roadmaps / hopefully motivate vendors. The standards to watch for are CSS Paged Media Module Level 3 (specifically content attribute in page margin boxes and counter attributes in @page, for page numbers) and CSS Generated Content for Paged Media Module (target-counter, for TOC numbers). Alternatively, maybe CSS regions.

If not, I'm not sure why we turned all wikis on

It helps to compress the timeline for usability testing, I imagine.

From my point of view, I think it's clear that we need to resolve the stability issues with Electron. If wkhtmltopdf was not under discussion, I would propose seriously looking into a migration to Chrome's headless mode now. In any case, there are better browser-based options now, and we will replace Electron with one of them.

There are four possible approaches:

Quick update on that:

  • the problems with vivliostyle were mostly caused by the inflexibility of its CSS parser (upon encountering invalid CSS it discards the rest of the stylesheet). There were some other errors which would also be dealbreakers for us. They might not be that hard for vivliostyle devs to fix, and they were interested in at least discussing the possibility, so we'll see how that goes. Other than that, it does a good job, but that's a pretty big "other than". (There is also the AGPL issue which is unclear at this point whether it would be problematic for us.)
  • PDF post-processing works but results in a lot uglier pipeline (user sends request to MediaWiki -> MediaWiki concatenates document and posts it to Electron -> Electron sends it to MediaWiki -> MediaWiki post-processes the PDF in memory and sends the data to the user) and the PHP code is not great maintenance-wise. (The post-processing could also be done in the Node service or by shelling out, not sure if that would be much of an improvement.)
  • T169897 is the task for tracking browser support for a pure-browser solution. GCPM is still a work in progress so I wouldn't expect anytime too soon.
Aklapper lowered the priority of this task from High to Low.Dec 8 2020, 8:05 AM

Three years later, is this still on the list and should remain open? Or is Collection's relation to Proton fine?
This doesn't look like high priority either.

ovasileva added a subscriber: sdkim.

I think we can go ahead and decline this. @sdkim - feel free to open if you think it might still be relevant.