Maniphest T171965

[Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• bmansurov
	Jul 28 2017, 3:55 PM

Description

In T171960 we're creating a script that adds page numbers and the table of contents to a PDF file. The script will be written in Python. Once the script is ready, how should we expose it be used by other services/extensions? The goal is to take a PDF generated by ElectronRenderService and post-process it and provide the output to the requesting service.

Should we:

Use a WSGI server compliant python web framework such as Falcon or Flask?; or
Should we just create a package and make it available via Pypi; or
Bundle it as a dependency of Extension:Collection and execute the scripts from PHP?

Outcomes

Decide on how to expose the script above.
We document how we exposed the library for future projects.

Related Objects
Search...

Status	Assigned	Task
Resolved	• JKatzWMF	T150871 [EPIC] (Proposal) Replicate core OCG features and sunset OCG service
Invalid	None	T186740 [EPIC] It should be possible to print a book using the Proton service
Stalled	None	T174670 Remove banner from saved books
Invalid	None	T171832 Deploy new book renderer to all projects
Duplicate	None	T171833 Deploy new book renderer to all projects side by side with OCG
Declined	None	T173018 Add an option in Special:Book to download PDFs generated by ElectronPdfService
Invalid	None	T173015 Use PDF post-processing service to generate final PDF
Invalid	None	T173579 Expose PDF post-processing scripts as a stateless web service
Resolved	• bmansurov	T171965 [Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection
Invalid	ovasileva	T171960 Create a library to post-process PDF and add page numbers and table of contents
Resolved	ovasileva	T168871 Introduct toc with page numbers during pdf post-processing
Resolved	Jdlrobson	T176463 [Spike 8hrs] Investigate libraries for post-processing without non-JS dependencies
Resolved	pmiazga	T171838 Build out article concatenation according to requirements for books
Resolved	phuedx	T171964 [Spike - 8 hrs] Where should article concatenation be implemented?
Resolved	ovasileva	T175856 Implement changes to article concatenation based on books requirements
Invalid	ovasileva	T177805 [Spike] How do we render contributors and images section of books accurately?
Resolved	phuedx	T177672 Collection tests do not run properly
Resolved	phuedx	T177801 Collection phpunit tests are failing for table of contents when run locally
Resolved	Jdlrobson	T177892 PDF table of contents styling font-size is inconsistent
Invalid	None	T177993 Article concatenation fails on large books
Resolved	ovasileva	T177994 Book generation fails for articles with '/' character in title
Invalid	None	T177996 Article concatenation not resilient to curl errors
Resolved	• dpatrick	T173014 Security review of pdfrw

Event Timeline

• bmansurov created this task.Jul 28 2017, 3:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 28 2017, 3:55 PM

Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.Jul 31 2017, 5:29 PM

Jdlrobson added a project: Electron-PDFs.Jul 31 2017, 5:37 PM

Jdlrobson moved this task from Untriaged to Move to Backlog on the Web-Team-Backlog (Tracking) board.Aug 1 2017, 8:54 PM

phuedx added a subtask: T171960: Create a library to post-process PDF and add page numbers and table of contents.Aug 2 2017, 7:48 AM

ovasileva triaged this task as High priority.Aug 10 2017, 2:29 PM

phuedx added a parent task: T150871: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service.Aug 10 2017, 2:39 PM

phuedx updated the task description. (Show Details)

• bmansurov mentioned this in T173015: Use PDF post-processing service to generate final PDF.Aug 10 2017, 4:08 PM

• bmansurov added a subtask: T173015: Use PDF post-processing service to generate final PDF.

• bmansurov renamed this task from Expose PDF post-processing library to Expose HTML concatenation and PDF post-processing scripts.Aug 18 2017, 3:58 PM

• bmansurov updated the task description. (Show Details)

• bmansurov renamed this task from Expose HTML concatenation and PDF post-processing scripts to [Spike] Expose HTML concatenation and PDF post-processing scripts.Aug 18 2017, 4:14 PM

• bmansurov renamed this task from [Spike] Expose HTML concatenation and PDF post-processing scripts to [Spike] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection.

• bmansurov mentioned this in T173579: Expose PDF post-processing scripts as a stateless web service.Aug 18 2017, 4:19 PM

• bmansurov added a parent task: T173579: Expose PDF post-processing scripts as a stateless web service.

• bmansurov removed a subtask: T173015: Use PDF post-processing service to generate final PDF.Aug 18 2017, 4:23 PM

• bmansurov added a subtask: T171838: Build out article concatenation according to requirements for books.Aug 18 2017, 4:33 PM

• bmansurov removed a parent task: T150871: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service.Aug 18 2017, 4:38 PM

• bmansurov removed a subtask: T171838: Build out article concatenation according to requirements for books.

• bmansurov added a subtask: T171838: Build out article concatenation according to requirements for books.

@bmansurov: Before starting T171838: Build out article concatenation according to requirements for books, would it be worth investigating this so that it can be folded into your decision to try to build out the concatenation in Python?

@phuedx, we can investigate this in parallel. I don't think the outcome of this spike affects the implementation of T171838: Build out article concatenation according to requirements for books as we can choose either option (suggested in the description) and not worry about changing the implementation of T171838.

phuedx mentioned this in T171964: [Spike - 8 hrs] Where should article concatenation be implemented?.Aug 21 2017, 3:14 PM

• bmansurov renamed this task from [Spike] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection to [Spike - 8 hours] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection.Aug 23 2017, 5:44 PM

ovasileva added a project: Readers-Web-Kanbanana-Board-Old.Aug 23 2017, 5:47 PM

ovasileva moved this task from To Do to Needs Design Review on the Readers-Web-Kanbanana-Board-Old board.

• bmansurov claimed this task.Aug 23 2017, 7:56 PM

• bmansurov moved this task from Needs Design Review to Doing on the Readers-Web-Kanbanana-Board-Old board.

Here are some pros and cons from a developer's point of view. To choose the best approach (which may not be in the identified approaches), we need some domain expertise from SRE, Release-Engineering-Team, and Services. I wonder who the best people to talk to here is (cc @phuedx).

A standalone service

Why do it this way

Expertise. The foundation has projects like Wikimetrics that are build this way. Wikimetrics runs on Flask.
Scalability. Depending on the load, more virtual server instances can be spun off. Someone more familiar with this should clarify.
Security. Since Python scripts won't be run in the same environment as MediaWiki, their impact to other parts of the system will be limited to the instances they are running on. This is not a huge advantage because our code and third party libraries will go through a security review first. However, it's still desirable to know that bugs in software don't cause system-wide harm.
?

Why not do it this way

Overhead of creating and maintaining a new service.
?

A Python package

Why do it this way

Availability. By default Python is available on our servers (which are either Ubuntu or Debian). Needs clarification on whether MediaWiki extensions are allowed to run Python scripts from PHP or whether the Python executable has been masked.
Simplicity. No need to configure an external service.
?

Why not do it this way

Scalability. Depending on the size of the book that's beeing generated, concatenation and PDF post-processing can be resource consuming processes. This may affect the MediaWiki environment that the scripts are running in.
?

Executable files

Why do it this way

Availability. We can use something like Pyinstaller to generate stand-alone executables and call these executables from PHP.
?

Why not do it this way

Complexity. We'll be adding one more step (in addition to the actual development of the codebase) - the execution generation step to achieve the same result. If a Python executable is available on the system already, there is no need to go this route. We may as well call python some_script.py.
?

More info

Wikimedia servers

Please chime in with your suggestions/improvements.

• bmansurov moved this task from Doing to Needs Code Review on the Readers-Web-Kanbanana-Board-Old board.Aug 24 2017, 1:53 PM

@faidon, @greg: We're in need of your guidance here: We're confident that Python is the right choice here as there are no good libraries for manipulating PDFs in PHP or Python (see T171964#3516675 onwards). However, we're not quite sure how best to use a Python script from a PHP extension.

Can our application servers run Python? Is there prior art for shelling out to a Python script on our application servers?

Also, sorry if you're not the best folk to ping about this. If that's the case then who is?

I don't know of any prior art of a python scrip on the app servers. But I do know that we have python services in production (like ORES and Striker).

• Mholloway subscribed.Aug 25 2017, 1:31 AM

phuedx renamed this task from [Spike - 8 hours] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection to [Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection.Aug 25 2017, 11:19 AM

phuedx updated the task description. (Show Details)

• bmansurov updated the task description. (Show Details)Aug 29 2017, 2:16 PM

@faidon - not sure if you had a chance to look at this yet. Sorry to double-ping you, but given our timeline, there's a bit of urgency around this.

Joe subscribed.Aug 30 2017, 11:47 AM

Honestly... I'm not exactly sure what you're proposing :) Is there a design document or something that describes the architecture of the system you're thinking of implementing?

Take a look here [1]. The document was written before we decided to do the concatenation part in PHP, but other than that it should accurately describe what we're trying to do.

[1] https://www.mediawiki.org/wiki/User:Bmansurov_(WMF)/Alternative_way_of_generating_PDF_books

Hi, I took a look at your current proposal and I see a series of issues with it. I might still not have understood what you're proposing fully, if that's the case, please let me know!

So, let's start from the least preferrable option in my opinion:

Shelling out from MediaWiki is possible, but it's hardly recommended

There are several reasons for this:

It goes in the opposite direction of a microservices-based architecture.
It poses problems of security (you'll be handling pdfs on the same servers that hold all the security credentials for all our databases)
Would be a wonderful DOS vector against mediawiki (pdf processing for large books takes time, and shelling out is syncronous).
Shellouts from MediaWiki has several memory and file size limitations, for a good reason. We won't be raising for this specific case, and this can specifically hit both those limits easily

So I strongly oppose the idea of creating a python script and shelling out to it.

A separate service is not ideal either

The idea of having mediawiki that calls this new collections service X that on its own calls PDFRender. This creates a chain of calls that adds additional possible points of failure and makes debugging more complicated. In particular I'd like to understand why you can't implement this functionality inside the existing service, or (alternatively) why do we want to use PDFRender for this at all, if it's written in a language that doesn't have good pdf mangling libraries.

Still, I consider this option much better than shelling out

But even in this case, the architectural requirement to make requesting and fetching the collection asynchronous from MediaWiki (as it is today) in order not to have to fight with timeouts or - worse - with a denial of service vector.

The ideal solution (from and architectural prespective)

A self contained service that:

serves single pages directly (synchronously)
serves collections asyncrhonously if they take more than X seconds to compute (this might require at least some ephemeral storage service)

This could be done (for example) adding another endpoint to the current PDFRender service, where you can maybe shellout to python if needed.

This is a very high level overview of guidelines on what could/should be acceptable. I think more feedback/discussion about how this part of the service should be fit inside our overall architecture is needed.

ovasileva added a project: Proton.Aug 30 2017, 4:55 PM

ovasileva moved this task from Triage to Current Sprint on the Proton board.Aug 30 2017, 5:00 PM

@Joe thanks for the input. Just to be clear, there is no single proposal. We are looking at pros and cons of each approach.

It was not clear to me what you're referring to when you say PDFRender, but I suppose you mean Extension:ElectronPdfService? If I understand you correctly, from the ops perspective the best place to do PDF post-processing is via ElectronPdfService. Am I right?

Also, going through the remainder of the design document and the implementation PoC, I could summarize the flow as follows:

MediaWiki requests the pages HTML to restbase
1. Restbase will call parsoid to get that HTML if not cached
2. Parsoid will call the MediaWiki API to render that HTML
Something (not clear to me if php or a python script) joins this html together and sends it to pdfrender
Pdfrender generates the pdf
Some post-processing script adds a table of contents and page numbers

Can you confirm me this is the request flow you had in mind?

I see several issues with that request flow, but I'd like to confirm my understanding of it is sound.

MediaWiki requests the pages HTML to restbase

Restbase will call parsoid to get that HTML if not cached

Parsoid will call the MediaWiki API to render that HTML

I'm not sure if we're calling the MW API or RESTBase. @Tgr could you clarify?

Something (not clear to me if php or a python script) joins this html together and sends it to pdfrender

Extension:Collection will do it.

Pdfrender generates the pdf

I think Pdf render is also a RESTBase end point. @Tgr is that correct? Or are we talking to Extension:ElectronPdfRender?

Some post-processing script adds a table of contents and page numbers

Yes this is a Python script that we're building.

This is the control flow as proposed now:

MediaWiki requests HTML of pages from RESTBase
- RESTBase might fall back to Parsoid if not cached; Parsoid partially relies on the PHP API for rendering
MediaWiki concatenates the pages into a single document ('concatenate' is a bit misleading; this requires parsing the HTML)
MediaWiki POSTs the HTML to Electron (mediawiki/services/electron-render; RESTBase and ElectronPdfRender cannot be used as do not expose rendering arbitrary HTML, for obvious reasons)
Electron responds with a PDF file
MediaWiki shells out to Python which does some post-processing on the PDF file
MediaWiki returns the PDF to the user.

It's all rather roundabout. To a large extent that's inherent in the problem: the user needs to communicate with the MediWiki UI, the post-processing needs to be done in Python (seems to be the only language with decent PDF editing support), Electron is a Node app (in theory we could use some Python-based way of remote-controlling Chrome, such as prerender, but that seems like a big project for relatively little benefit). So we could for example have a Python service instead and go MediaWiki -> web redirect to Python service -> RESTBase (->Parsoid->MW) -> Python service -> Electron -> Python service but it wouldn't be much of an improvement.

The (not necessarily permanent) choice of shelling out was made mainly based on time constraints: Ops wants to get rid of OCG very soon, and I doubt writing and deploying a new Python webservice would be a matter of days. Most of the pieces for the above arrangement are already in place, and we use firejail for limiting the security impact of some shell commands which presumably can be reused here (the impact seems relatively limited anyway; code execution vulnerabilities are rare in dynamic languages).

As for DOS, timeouts etc, the assumption was that rendering a PDF from a HTML page is much more expensive then the needed pre- and procesing. It involves building a DOM tree, then computing the styles and merging into a render tree, then layouting, then painting, then serializing to PDF... on the other hand, Remex is a streaming HTML processor which just makes a single pass through the document and can discard elements from memory once it encounters the close tag (so the memory use will be proportional to tree depth, not tree size), and PDF post-processing mostly involves PDF metadata and does not require parsing most of the content at all. (Presumably. I have no idea how pdfrw is actually implemented.) So anything that kills the PDF or Python parts would kill the Node part anyway.

At the Reading/Services sync a standalone Python service seemed to be the consensus option. It's superior from a security and scalability POV, but it would also expand the timeline of the project a bit, I imagine.
That raises the question: Services is considering replacing Electron with headless Chrome (T172815) and the latter is easily available in Python as well, so should that Python service become the replacement for Electron? @GWicke any thoughts?

(Re: the option of a Python package in T171965#3547239, I think that's orthogonal. Whether we end up with a Python script or web service or precompiled executable, the nice approach is definitely to separate the actual postprocessing logic into a package.)

@Tgr, at first sight it looks like there are reasonable python bindings for headless Chrome as well. Combined with the PDF post-processing library you have been testing, I could see a simple python service doing both pre/postprocessing and actual rendering work well. The service portion of either option is trivial in any case, and all the heavy lifting is in the libraries & Chrome.

This service would be locked down inside firejail, ideally without write access to the local filesystem, and can be restricted at the network level to only allow access public resources.

In T171965#3569085, @Tgr wrote:

This is the control flow as proposed now:

MediaWiki requests HTML of pages from RESTBase

RESTBase might fall back to Parsoid if not cached; Parsoid partially relies on the PHP API for rendering

MediaWiki concatenates the pages into a single document ('concatenate' is a bit misleading; this requires parsing the HTML)

MediaWiki POSTs the HTML to Electron (mediawiki/services/electron-render; RESTBase and ElectronPdfRender cannot be used as do not expose rendering arbitrary HTML, for obvious reasons)

Electron responds with a PDF file

MediaWiki shells out to Python which does some post-processing on the PDF file

MediaWiki returns the PDF to the user.

Thanks for the thorough explanation.

In T171965#3571419, @Tgr wrote:

At the Reading/Services sync a standalone Python service seemed to be the consensus option. It's superior from a security and scalability POV, but it would also expand the timeline of the project a bit, I imagine.
That raises the question: Services is considering replacing Electron with headless Chrome (T172815) and the latter is easily available in Python as well, so should that Python service become the replacement for Electron? @GWicke any thoughts?

(Re: the option of a Python package in T171965#3547239, I think that's orthogonal. Whether we end up with a Python script or web service or precompiled executable, the nice approach is definitely to separate the actual postprocessing logic into a package.)

If we decide to develop a service, I'd go the same way we went with ORES, creating a repository with wheels to deploy to a virtualenv, then serve the application via uwsgi.

This service would replace the current electron pdf renderer as well on the medium/long run, right?

• GWicke updated the task description. (Show Details)Sep 5 2017, 5:01 PM

• MZMcBride subscribed.Sep 5 2017, 9:38 PM

Liuxinyu970226 subscribed.Sep 6 2017, 9:05 AM

This service would replace the current electron pdf renderer as well on the medium/long run, right?

Yes, and in doing so would resolve T172815 as well.

Thanks everyone for comments. Looks like the decision is to do this in a separate Python service.

AFAICT all technical stakeholders (Ops, RelEng, and Services) are satisfied that building the post-processing step as a stateless web service is the correct way to proceed.

In T171965#3572379, @Joe wrote:

If we decide to develop a service, I'd go the same way we went with ORES, creating a repository with wheels to deploy to a virtualenv, then serve the application via uwsgi.

Any links documentation about that process would be appreciated!