Page MenuHomePhabricator

[Spike - 8 hrs] Where should article concatenation be implemented?
Closed, ResolvedPublic

Description

We'd like to take a list of articles in the metabook format and generate a single HTML file. The HTML file will be converted to a PDF file using the ElectronRenderService.

Outcomes

  • A decision is made as to where article concatenation is implemented. The options are:
    • Extension:Collection
      • A proof-of-concept patch for Extension:Collection has been done as part of adding page numbers to the PDF table of contents.
    • A new service.

@phuedx: See T171964#3552933.

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 28 2017, 3:53 PM
ovasileva renamed this task from Concatenation of list of articles to [Spike] Concatenation of list of articles.Aug 8 2017, 4:52 PM
ovasileva triaged this task as High priority.Aug 8 2017, 4:53 PM
bmansurov updated the task description. (Show Details)Aug 9 2017, 1:39 PM
bmansurov renamed this task from [Spike] Concatenation of list of articles to [Spike - 8 hours] Concatenation of list of articles.Aug 9 2017, 5:26 PM
bmansurov claimed this task.Aug 9 2017, 6:16 PM
bmansurov updated the task description. (Show Details)

Here's my attempt at answering the questions above. A lot of the decisions depend on choosing the right architecture. Services and Ops teams are best suited to recommend one way or the other.

Kinds of concatenation is needed

We identified the changes needed for concatenation in T163272: [Spike] Determine changes necessary for concatenation support . Here is the modified list (as the original spike was about wkhtmltopdf):

  • Create a cover page in HTML.
  • Retrieve articles in HTML format, and lay them out in the hierarchy requested in requrested metabook. For each article:
    • Make sure original articles do not contain the table of contents, or else remove the table of contents.
    • Create a title with the chapter number, e.g. "1. Apple"
    • Prefix section titles with the chapter number and section number, e.g. "1.1. Botanical Information"
    • Since articles can be grouped into chapters on /wiki/Special:Book, we need to make each article a subsection of a chapter if the article is a part of a group. For example, if I'm interested in creating a book about fruits and vegetables, I may have two chapters called "1. Fruits" and "2. Vegetables". The article "Apple", would go under "1. Fruits" and be titled as "1.1. Apple". Sections of the article would be prefixed with "1.1.1.", "1.1.2", etc.
    • We may also have to '"push down" headings when the page has a =-level section' but this case is rare.
    • Make sure ID's are unique
    • Change references links to point to the references on the page as opposed to the references in the source URL.
  • Retrieve Contributors, images (https://en.wikipedia.org/w/api.php?action=query&titles=File%3ABook_Collage.png&prop=imageinfo&iiprop=url|size|mediatype|mime|sha1|extmetadata), and license info from the MW API endpoint and create an HTML page using them.
  • Put the cover page, articles, contributors page, and the license info together in one HTML file.

Extension:Collection

Why do it here?

  • A proof-of-concept patch already exists. Some modifications will be needed though.
  • It's a logical location for such a feature. The UI for generating books lives in this extension. It maybe a good idea to include the generation of concatenated HTML in the extension as well.
  • No need to create a separate service. Drop some code in the extension and we're ready to go.
  • The HTML parser (RemexHtml) is controlled by us, so it maybe easy to adjust it to fit our needs.

Why not do it here?

  • We don't have too many options when it comes to manipulating the DOM. RemexHml seems to be the only acceptable choice. I'm not sure how performant it is compared to other solutions.
  • RemexHtml is only used by Wikimedia projects (as far as I know), which means it may contain some serious uncovered bugs or not.
  • Depending on how Extension:Collection is setup, it maybe hard to horizontally scale the servers because we'd have to scale MediaWiki instances first as the extension lives inside MediaWiki. On the contrary, a single micro-service maybe easier to scale.

A micro service

Presumably we'll be using Node.js as it's been used across the foundation in various projects and we don't want to introduce yet another stack.

Why do it here?

  • A lot of libraries to choose from for DOM manipulation. Some examples are JSDom, cheerio, etc.
  • Since the majority of Readers developers are more familiar with our JavaScript stack, we won't be stuck when we need help with maintaining the service or improving it. If we go with something like cheerio, it's basically jQuery in the server, and anyone can dive right in the codebase and start contributing.
  • Maybe easier to scale than the monolith solution.
  • In the long-run, extensions other than Collection, or other services can use this service as the service does one thing only and will be designed to be consumed by outside services.

Why not do it here?

  • Initial and continued cost of maintenance of the service (from the ops perspective).
  • A special care maybe needed when exposing the service to the outside world. We may want to keep it private and make it accessible by Extension:Collection only for now. I'not sure how much work that is from the operations point of view.

Traditional unix tools

Why not create a shell script and use tools like tr, grep, and wget?

Why do it this way?

  • Arguably, we'll write fewer lines of code compared to the solutions above. This translates into fewer bugs, and less maintenance work.
  • Each tool does one job and does it well.
  • Reliable. The tools have been around so long that, we can virtually rely on them as being bug free. (We should look for bugs in the shell script that we'll be writing.)
  • More performant because these tools are written in a lower language than PHP and JavaScript.

Why not do it this way?

  • It's not clear where the shell script would live. We'd still need to talk to it from either Extension:Collection, or a Node.js server, or somewhere else.
  • I'm not sure how comfortable our developers feel while working with these tools.
  • Yet another set of tools.

Other considerations

  • In order to make the best choice we may also want to compare various solutions to test memory consumption, cpu usage, execution speed among other things. Is RemexHtml more performant than JSDom or cheerio? Or is performance our main goal here?

@Tgr, @GWicke I'd appreciate your views on the above comment. Thank you.

phuedx updated the task description. (Show Details)Aug 10 2017, 2:18 PM
phuedx renamed this task from [Spike - 8 hours] Concatenation of list of articles to [Spike - 8 hours] Where should article concatenation be implement?.Aug 10 2017, 2:21 PM
phuedx renamed this task from [Spike - 8 hours] Where should article concatenation be implement? to [Spike - 8 hours] Where should article concatenation be implemented?.
phuedx updated the task description. (Show Details)
phuedx renamed this task from [Spike - 8 hours] Where should article concatenation be implemented? to [Spike - 8 hrs] Where should article concatenation be implemented?.Aug 10 2017, 2:27 PM
Tgr added a comment.Aug 10 2017, 6:16 PM
  • chapter/section number is much easier to do via CSS counters than by rewriting the HTML.
  • I don't see the potential benefit in using unix tools. You would have to use something pretty complex like awk, which would result in code most developers can't easily read, and the interaction with the usual development ecosystem (unit tests, logging, debugging tools etc) would be awkward. Not to mention that it would limit the extension to Linux installs (and only specific distributions, unless you pay a lot of attention to portability).
  • I doubt performance is a big deal since PDF rendering will probably take much more time than the simple HTML changes that are proposed here; there is little value in trying to optimize parts of the system which are already relatively fast. That said, RemexHTML has better asymptotic performance than the alternatives (see Tim's comment in T163272#3272877) and probably less overhead as well since the delays inherent in communicating via HTTP are very likely going to be larger than any speed benefit a Node implementation might possibly have over a PHP implementation in object instantiation time and whatnot.

(Also, personally I think Reading Web developers staying unfamiliar with the PHP stack is not a reasonable thing to plan for in the medium term. That lack of familiarity has harmed Reading Web projects in the past and will continue to do so unless addressed.)

phuedx added a subscriber: phuedx.Aug 14 2017, 5:19 PM

tldr: Choice seems to really hinge on who will be looking after this after OCG has been sunset.

chapter/section number is much easier to do via CSS counters than by rewriting the HTML.

Definitely. Support is also pretty good (although wouldn't work in IE6 and 7, but that's fine). This can be done in both a service or php layer and would expect this to come up during implementation.

I don't see the potential benefit in using unix tools.

Agreed. I would rule that out as an option.

I doubt performance is a big deal since PDF rendering will probably take much more time than the simple HTML changes that are proposed here;

Also agree, but we'll probably want to put an upper limit on the amount of pages that can be concatenated, otherwise this could be a vector for attacks regardless of whether done in PHP or service.

(Also, personally I think Reading Web developers staying unfamiliar with the PHP stack is not a reasonable thing to plan for in the medium term. That lack of familiarity has harmed Reading Web projects in the past and will continue to do so unless addressed.)

Given the size of the team and the amount of products we do maintain (including those dumped on us), that we do require knowledge of the PHP stack for I find this quite offensive (and a little personal) to be honest. I'm happy to talk about it outside this phabricator task with management if it would be useful, but I don't think this is the place.

A micro service

Given reading web's projects are becoming more and more Node dependent and services based this is a valid thing to consider. Which team will be maintaining this on the longer term? If reading web, it's going to get a much higher level of support if the code is also services based given that is our Toby and Adam approved strategic direction.

(Also, personally I think Reading Web developers staying unfamiliar with the PHP stack is not a reasonable thing to plan for in the medium term. That lack of familiarity has harmed Reading Web projects in the past and will continue to do so unless addressed.)

I'll admit that I had a hard time unpacking this comment. The more general point that Readers Web engineers must be familiar with the PHP stack is valid, but I disagree with you that they are unfamiliar with it.

Our medium- and long-term goals all hinge on consuming MediaWiki APIs, some of which RW are on the hook for maintaining. I expect Readers Web to be contributing to whatever stack underlies these APIs whenever the work requires it.

If you haven't already, then please do reach out to @dr0ptp4kt and me with any concerns that you might have and we'll do our best to address them. I think dissecting the outcomes of previous Readers Web projects is way, way out of scope for this task.

tldr: Choice seems to really hinge on who will be looking after this after OCG has been sunset.
...

A micro service

Given reading web's projects are becoming more and more Node dependent and services based this is a valid thing to consider. Which team will be maintaining this on the longer term? If reading web, it's going to get a much higher level of support if the code is also services based given that is our Toby and Adam approved strategic direction.

That the question as to who maintains this in the long term is open should point us towards maximizing the number of folks (Foundation- and Community-wide) who could help to maintain it IMO. Homing the majority of this system in the PHP stack achieves this. Regardless, we need to be clear about what support we're being expected to provide and what support we can feasibly provide, which is something to @ovasileva and me to consider.

Also, writing the majority of this system in PHP would also leave open the opportunity to integrate the WSExport tool, which powers the "Download as EPUB" link on https://wikisource.org pages.

@bmansurov thank you for diving deep into different stacks to find the best solution that both solves the problem we're facing and meets our requirements as developers.

existing proposals

Traditional Unix tools

As you said, Unix tools are powerful, they are reliable, fast (mostly written in C) and they may be suitable for the current problem. In long-term it will be difficult to maintain/extend the functionality as bash/awk scripts are pretty difficult to work with. As different people said this is off the table

A micro service

Sounds reasonable. I like the idea of being possible to scale. With every month we develop more and more stuff in Node and it works pretty well. I'm worried about two things

  • choosing the proper library for HTML handling. JS is pretty famous for having thousands of different libs doing the same thing. Also almost every day there is a new library that comes out and does something better. jQuery is pretty slow, DOM traversal is in general something that costs a lot. Some basic changes can be done with preg_replace() and it will be much faster with lower memory footprint.
  • Node provides a great environment for services which require some state. HTML concatenation is stateless as it doesn't need any information from the previous request and current request will not affect any future requests. Once service receives metabook file it has to download all files from textextracts, merge it and output HTML. There is almost nothing it could reuse between different requests.

I'm also concerned about node being async and the fact that we might have to retrieve 200 articles. Sending all requests asynchronously at the same time might be a weak link. Some requests might fail, what do we do? Probably it's better to send them one after another. Even it takes a bit more time it's safer and we will be able to recover quickly.

Extension:Collection

This sounds good as most of our stuff is written in PHP, and this task (as I mentioned in micro service section) is a perfect candidate for PHP request. It comes in, HTML gets produced and returned. No session/state handling, every request is a unique/standalone job. Using RemexHTML might be useful as it's developed by us, there might be bugs but we're responsible for fixing those, if not there is DOMDocument which does what it needs to do.

I'm just not sure if using Extension:Collection is a proper place as this extension is pretty old (latest version is over 3 years old). While testing it Anthony found many issues (https://phabricator.wikimedia.org/T166565#3320665). Before we start working on Collection extension it will require some refactoring/updating to latest standards. I would prefer to keep Extension:Collection as a UI only as it is mostly a skin Extension. We can re-use same backend for different Skins (Vector, MinervaNueue, Timeless, Cologne) but user-facing stuff might have to be created separately per each skin.

If we decide to add a possibility to create books in Minerva it will require some changes in existing extension (which may break the desktop version). Making a HTML concatenation service as a standalone service/different extension looks more reasonable to me.

Standalone PHP service

It can be a MW extension if we're going to reuse MW stuff, but I don't see a valid reason why we need MediaWiki Core for that.

To me, this is a standalone service that

  • receives a file in metabook format
  • verify in cache that it did not create the same book in last couple hours
  • retrieves data from MediaWiki/RestBase/MCS
  • has a set of rules/formatters which are applied to the HTML (not one big chunk of code, set of classes that modify HTML step by step)
  • applies print styles
  • output HTML or can use post processing to
    • output PDF calling electron directly
    • output EPUB using WSExport

We would be able to scale this PHP service same as Node service (every request is a new thing, using cache will not help, only spinning new instances of PHP servers or giving more resources to the existing server). Once HTML building is done PHP all memory consumed in given request can be easily freed. The only async thing we can do is downloading data from MW/RB/MCS so it's probably not a big deal - crunching data/traversing dom will take much more time than sending couple calls to the Varnish.

Thanks everyone for a review. I'll highlight a few points that were brought up (please read comments above if you're interested in the details):

  • Using Unix tools is the least desired option out of the other options suggested.
  • We shouldn't worry about performance just yet, as the bottleneck is largely at the render phase.
  • There's a concern about setting an uppre limit of pages to be rendered. (Although I think it's a valid concern, it should be tackled separately, possibly by limiting the number of articles that can be printed to PDF).
  • Doing in PHP also makes sense because other features such as "Download as EPUB" are also written in PHP and integrating them in the future will be easier.
  • Standardization is an issue in Node-land.
  • Node may not be best suited for the job.
  • Extension:Collection codebase has been stale for sometime and may require us to modernize it (by refactoring, etc.) and first before being able to contribute.
  • We may have a hard time getting Extension:Collection to work in MobileFrontend
  • Another approach (besides the three suggested at T171964#3514554) is to create a new PHP service.

Having summarized the above points I'll be bold and suggest that choosing between PHP and Javascript is mostly a convenience issue, it seems. It's a question of whether we want to maintain a new PHP service, or Extension:Collection, or a Node.js serivice and how much up-front work it requires. I personally like @pmiazga 's suggestion with some modifications.

Here's how I see the issue solved.

One way or another we have to create a script that does post-processing as described in T171960. The choice of the script should decide what we do here. @Tgr mentioned some issues with PHP libraries for PDF processing in T168871#3419245. Node.js has PDFKit but I'm not sure how well it suits our needs. We'll need to do a spike and see. On the other hand we've demonstrated that the post-processing can be done using a Python library in T168871#3449625. So, rather than creating a new PHP service, I suggest we bundle concatenation support with post-processing script. We were going to expose the Python script to Extension:Collection anyway, so why not leave Extension:Collection as is and let it talk to the python script, which does concatenation and post-processing? Does anyone see any issues with this suggestion?

bmansurov removed bmansurov as the assignee of this task.Aug 17 2017, 5:38 PM
bmansurov updated the task description. (Show Details)Aug 17 2017, 6:24 PM

Does anyone see any issues with this suggestion?

We haven't yet spoken about what's required to deploy a separate service. Node.js is attractive in this context as we have direct contact with folk who've deployed such services thus far and the Services team providing support; we get some instrumentation for free, for example. While building out this functionality in PHP in an extension might not be as architecturally clean, we get a lot of support for free at the manageable cost of not being in direct control of the deploy cycle.

I think T171965: [Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection might cover this but I think the answers are relevant here.

"We were going to expose the Python script to Extension:Collection anyway, so why not leave Extension:Collection as is and let it talk to the python script, which does concatenation and post-processing".

I noticed work began on this which I wasn't expecting.
Could we pause and talk about this? I didn't know we were using python in the collection extension for example .

I'd appreciate the team having a presentation/ overview of the work and decisions made so far before actually beginning work given the entire team is on the hook for delivering this. I had thought (hoped?) that was the plan with the spike work.

Maybe now is a good time for all the team (or at least Piotr and myself) to get involved and help deliver this...

@Jdlrobson ,

Sure, I'll put the work on hold. We can talk about this during stand-up. Although an overview/presentation maybe useful for the rest of the team, I encourage you to take a look at the related tasks. We've done a good job of making them clear for outsiders.

As a side-note, T171838: Build out article concatenation according to requirements for books was next in line and we were going to bring the task into the sprint (as we discussed earlier in a meeting). We also talked about this task in another meeting where the team awarded tokens to approve the decision made. That was a signal for me to start working on the task. Also, as a team we haven't been giving enough attention to the OCG replacement tasks, and there are quarterly goals that we need to meet. This fact also urged me to be bold and step up the work.

Do you have any specific concerns about the choice of Python? Or any other aspects of the development?

@Jdlrobson @pmiazga I've created a document to give a rough idea of the things we did so far and of the things that we plan on doing in: https://www.mediawiki.org/wiki/User:Bmansurov_(WMF)/Alternative_way_of_generating_PDF_books

Please take a look and let's continue the conversation in the talk page or the document itself. Feel free to leave your questions on either place and I'll be happy to answer them.

Tgr added a comment.EditedAug 23 2017, 2:31 AM

Re: what framework to use for concatenation and/or post-processing:

  • Currently we have MediaWiki, Node services and Python services on our cluster. Adding a new kind of thing (some kind of standalone PHP service) shouldn't be taken lightly IMO. Following the well-trodden path reduces maintenance overhead. (Plus, PDF post-processing can't be done in PHP anyway so what's the point in writing a PHP microservice to run non-PHP core logic?)
  • Given that this project is under severe time pressure, doing the work inside the PHP extension seems by far the best immediate approach to me:
    • The code for concatenation in PHP exists already
    • No overhead of setting up and managing a service
    • Unblocks Ops to get rid of the ocg boxes / RelEng to get rid of trebuchet / Services to update Node. The concatenation and/or PDF modification logic can always be moved into a service later.
    • Can still be exposed as an API if we want clients to be able to build their own UI for it. On one hand it's slightly more convenient because the API has access to the session (where collection data is currently stored), on the other hand using the action API to deliver files will be awkward. Neither of those are big issues though.
  • One Node-based scenario that would make a lot of sense is if we found a good Node library for PDF processing, then we could do everything in Javascript and reduce the number of languages involved in PDF generation. We already have a working solution though and researching PDF libraries is a nontrivial amount of time. (PDFKit at a glance cannot modify existing files so it is not useful here. HummusJS looks more promising.)
  • I agree the ability to output alternative formats (EPUB, Zim, whatever) is valuable. IMO that's a (weak) argument for doing as much in the extension as possible. We'll probably want a "RESTBase -> concatenated HTML + metadata -> format of choice" pipeline for those as well, but the metadata (and possibly the HTML) will probably need to be subtly different and it's not easy to predict how, so having it travel through more services makes life more difficult. Services are great when there is a simple, stable interface that can serve as a boundary between the different systems. That is the case with Electron (send HTML, get PDF back) but would be much less true for a HTML concatenation or PDF TOC generation service.

Re: language used for PDF post-processing:

  • It seems that pdflib can modify PDF metadata (not in-place, but it can copy PDF content with changes), and there is a PECL implementation, so a pure-PHP implementation using non-unmaintained packages might be possible. Inconvenient syntax + shady licensing situation though, probably not worth it.
  • I haven't seen any other PHP library that can alter PDF files and is in a decent state (Zend_PDF can do it, but it has been abandoned half a year ago in favor of TCPDF, which can't).
  • Python or Node would be a sensible choice if we find a good library (we did for Pythong already). One thing to consider is whether Node/Python is installed on the appservers. If not, we probably don't want to add it just because of this and might want to look into how well those languages support compiling to an executable.
  • I would keep the functionality minimal - fewer developers will be familiar with the language (especially if it's Python), and language boundaries are barriers for logging, debugging, refactoring etc. I don't see any advantage in moving HTML concatenation out of the extension (maybe in the long term if we go the Node service route, but not until then).

Nitpicks:

I'm also concerned about node being async and the fact that we might have to retrieve 200 articles. Sending all requests asynchronously at the same time might be a weak link.

You are not required to do everything at the same time just because it's Node. It's easy to write some kind of worker pool which does at most N requests in parallel. Async fetching seems like an advantage to me (although as you said, not a huge one given this is the fast part). It's not limited to Node though, MediaWiki has async HTTP support via MultiHttpClient (the POC patch uses it to fetch from RESTBase).

if not there is DOMDocument which does what it needs to do.

DOMDocument is not really reliable with HTML5 from what I heard.

Tgr added a comment.Aug 23 2017, 2:45 AM

Also, while the UI and code quality problems with Extension:Collection don't seem relevant to me (it's not as if it would go away if we do the concatenation elsewhere; the amount to interact with its existing codebase will be the same either way), using the ElectronPdfService extension is always an option. We are talking about functionality that needs to be wrapped around Electron to make it support multi-page documents, so it would make conceptual sense to put it into the extension dedicated to Electron.

Good points, @Tgr. I'll try to reply from the point of view of using Python and why it's a good choice (compared to PHP Extension and a node service).

  • Given that this project is under severe time pressure, doing the work inside the PHP extension seems by far the best immediate approach to me:
    • The code for concatenation in PHP exists already

I wouldn't call the code complete as it's mentioned in the commit message that it contains hacks and the code is a POC patch. Besides, it has a WIP label and hasn't gone through a review processes yet. That said, PHP codebase definitely has a head-start against Python codebase. If we are going to maintain the feature, we have to be careful and not throw code at an extension that hasn't been maintained for such a long time. It will definitely slow us down even in the short run.

  • No overhead of setting up and managing a service

Sure, same with Python if we bundle the code with Extension:Collection and call it from PHP (granted running Python code is supported, or we can bundle executables and distribute them with the extension).

  • Unblocks Ops to get rid of the ocg boxes / RelEng to get rid of trebuchet / Services to update Node. The concatenation and/or PDF modification logic can always be moved into a service later.

If we are going to move it to a service later, Python script is an ideal stand alone solution. The code will be self-containing, and as you mention above, we already have Python services in our infrastructure, so we won't be introducing anything new in terms of new technologies used when it comes to setting up a new service.

  • Can still be exposed as an API if we want clients to be able to build their own UI for it. On one hand it's slightly more convenient because the API has access to the session (where collection data is currently stored), on the other hand using the action API to deliver files will be awkward. Neither of those are big issues though.

I don't think we should be striving for adding more features to the current MediaWiki API when we can create a service that does the same job.

  • One Node-based scenario that would make a lot of sense is if we found a good Node library for PDF processing, then we could do everything in Javascript and reduce the number of languages involved in PDF generation. We already have a working solution though and researching PDF libraries is a nontrivial amount of time. (PDFKit at a glance cannot modify existing files so it is not useful here. HummusJS looks more promising.)

I remember us dismissing wkhtmltopdf because it was written in C++ and how it was a security risk. HummusJS also contains C++ code and we can bring up the same arguments here. It will take a long time to get it reviewed too, I imagine.

  • I agree the ability to output alternative formats (EPUB, Zim, whatever) is valuable. IMO that's a (weak) argument for doing as much in the extension as possible. We'll probably want a "RESTBase -> concatenated HTML + metadata -> format of choice" pipeline for those as well, but the metadata (and possibly the HTML) will probably need to be subtly different and it's not easy to predict how, so having it travel through more services makes life more difficult. Services are great when there is a simple, stable interface that can serve as a boundary between the different systems. That is the case with Electron (send HTML, get PDF back) but would be much less true for a HTML concatenation or PDF TOC generation service.

I'm not sure I understand why HTML concatenation and PDF post-processing cannot be thought of as similar services to Electron? As I'm currently proposing, we bundle these features into a single Python package, but in the feature, we can definitely move appropriate bits to two different services that do one thing only: concatenation or post-processing.

  • Python or Node would be a sensible choice if we find a good library (we did for Pythong already). One thing to consider is whether Node/Python is installed on the appservers. If not, we probably don't want to add it just because of this and might want to look into how well those languages support compiling to an executable.

Good point.

  • I would keep the functionality minimal - fewer developers will be familiar with the language (especially if it's Python), and language boundaries are barriers for logging, debugging, refactoring etc. I don't see any advantage in moving HTML concatenation out of the extension (maybe in the long term if we go the Node service route, but not until then).

I think we already have a handful of people who write Python on our team. Also, we'll need to write the PDF post-processing part in Python anyway as things stand. If HTML concatenation is already done in the extension, I'm fine leaving it there, but from my understanding, it's still a WIP. How much work do you think is left to get it done? On the other hand if we can get this functionality done as a standalone script with not much effort now, why not do it now? We'll be saving our feature selves a lot of new work.

Tgr added a comment.Aug 24 2017, 12:35 AM

I wouldn't call the code complete as it's mentioned in the commit message that it contains hacks and the code is a POC patch.

The commit message also mentions that the the concatenation code (BookRenderer/RemexCollectionMunger) can be used as-is. The logic for fetching data from Parsoid and MW core (DataProvider) is also easily fixable; I just hardcoded a bunch of URLs because I could not get Electron and RESTBase to work together in Vagrant. The rest of the patch is not really useful, but those already cover the functionality that could be moved to Python.

If we are going to maintain the feature, we have to be careful and not throw code at an extension that hasn't been maintained for such a long time.

I don't see why not. Collection will remain the interface to PDF rendering for the foreseeable future. Some parts of it will need to be touched anyway to integrate the new functionality; the rest doesn't need to be touched either way. But if you really want to use a well-maintained extension, ElectronPdfService is also available (it's currently maintained by WMDE but I imagine they wouldn't be saddened if Readers wanted to take over).

If we are going to move it to a service later, Python script is an ideal stand alone solution. The code will be self-containing, and as you mention above, we already have Python services in our infrastructure, so we won't be introducing anything new in terms of new technologies used when it comes to setting up a new service.

If we plan to run a Python service then it certainly makes sense. This would be the first time an Audiences team would maintain a Python service though, and being familiar with Python and being familiar with running Python web services are two very different things. So, not a decision to make lightly.

I don't think we should be striving for adding more features to the current MediaWiki API when we can create a service that does the same job.

Servicifying things that don't really need to be services is costly. The service will have to be maintained, deploying changes that affect both the service and the extension need to be coordinated, debugging becomes a lot more complicated. Also the extension becomes harder to use for third parties.

I remember us dismissing wkhtmltopdf because it was written in C++ and how it was a security risk. HummusJS also contains C++ code and we can bring up the same arguments here.

Fair point, I didn't notice that. I couldn't find other promising Node libraries with a cursory search.

I'm not sure I understand why HTML concatenation and PDF post-processing cannot be thought of as similar services to Electron?

Electron expects an URL or HTML body and returns a PDF file. That's a very simple and natural interface that's unlikely to change. The proposed service would have to receive a list of RESTBase URLs, title/chapter metadata, author metadata, section metadata... when your interface gets that complex, IMO that's a sign that you don't really need a separate service.

AFAIK we decided to do this in PHP during yesterday's kickoff meeting.

@Tgr is assigned to T171838: Build out article concatenation according to requirements for books to reflect that he's tidying up his WIP change to Extension:Collection, which @bmansurov, @Jdlrobson, and @pmiazga will all help review.

We decided to do this way due to a mix of looming deadline and the reasons laid out in T171964#3547535:

The commit message also mentions that the concatenation code (BookRenderer/RemexCollectionMunger) can be used as-is. The logic for fetching data from Parsoid and MW core (DataProvider) is also easily fixable; I just hardcoded a bunch of URLs because I could not get Electron and RESTBase to work together in Vagrant. The rest of the patch is not really useful, but those already cover the functionality that could be moved to Python.

i.e. that the WIP change isn't actually that far from not being a WIP, e.g. hardcoded URLs need to be moved to configuration variables.

phuedx closed this task as Resolved.Aug 25 2017, 1:26 PM
phuedx claimed this task.
phuedx updated the task description. (Show Details)
phuedx updated the task description. (Show Details)

If I've missed a detail, then please feel free to comment and add a link to the comment in the description for posterity. Regardless, this is Done 🎉🎉🎉