Page MenuHomePhabricator

[Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection
Closed, ResolvedPublic

Description

In T171960 we're creating a script that adds page numbers and the table of contents to a PDF file. The script will be written in Python. Once the script is ready, how should we expose it be used by other services/extensions? The goal is to take a PDF generated by ElectronRenderService and post-process it and provide the output to the requesting service.

Should we:

  • Use a WSGI server compliant python web framework such as Falcon or Flask?; or
  • Should we just create a package and make it available via Pypi; or
  • Bundle it as a dependency of Extension:Collection and execute the scripts from PHP?

Outcomes

  • Decide on how to expose the script above.
  • We document how we exposed the library for future projects.

See also

Related Objects

StatusSubtypeAssignedTask
Resolved JKatzWMF
InvalidNone
StalledNone
InvalidNone
DuplicateNone
DeclinedNone
InvalidNone
InvalidNone
Resolved bmansurov
Invalidovasileva
Resolvedovasileva
ResolvedJdlrobson
Resolvedpmiazga
Resolvedphuedx
Resolvedovasileva
Invalidovasileva
Resolvedphuedx
Resolvedphuedx
ResolvedJdlrobson
InvalidNone
Resolvedovasileva
InvalidNone
Resolved dpatrick

Event Timeline

bmansurov renamed this task from Expose PDF post-processing library to Expose HTML concatenation and PDF post-processing scripts.Aug 18 2017, 3:58 PM
bmansurov updated the task description. (Show Details)
bmansurov renamed this task from Expose HTML concatenation and PDF post-processing scripts to [Spike] Expose HTML concatenation and PDF post-processing scripts.Aug 18 2017, 4:14 PM
bmansurov renamed this task from [Spike] Expose HTML concatenation and PDF post-processing scripts to [Spike] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection.

@bmansurov: Before starting T171838: Build out article concatenation according to requirements for books, would it be worth investigating this so that it can be folded into your decision to try to build out the concatenation in Python?

@phuedx, we can investigate this in parallel. I don't think the outcome of this spike affects the implementation of T171838: Build out article concatenation according to requirements for books as we can choose either option (suggested in the description) and not worry about changing the implementation of T171838.

bmansurov renamed this task from [Spike] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection to [Spike - 8 hours] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection.Aug 23 2017, 5:44 PM

Here are some pros and cons from a developer's point of view. To choose the best approach (which may not be in the identified approaches), we need some domain expertise from SRE, Release-Engineering-Team, and Services. I wonder who the best people to talk to here is (cc @phuedx).

A standalone service

Why do it this way

  • Expertise. The foundation has projects like Wikimetrics that are build this way. Wikimetrics runs on Flask.
  • Scalability. Depending on the load, more virtual server instances can be spun off. Someone more familiar with this should clarify.
  • Security. Since Python scripts won't be run in the same environment as MediaWiki, their impact to other parts of the system will be limited to the instances they are running on. This is not a huge advantage because our code and third party libraries will go through a security review first. However, it's still desirable to know that bugs in software don't cause system-wide harm.
  • ?

Why not do it this way

  • Overhead of creating and maintaining a new service.
  • ?

A Python package

Why do it this way

  • Availability. By default Python is available on our servers (which are either Ubuntu or Debian). Needs clarification on whether MediaWiki extensions are allowed to run Python scripts from PHP or whether the Python executable has been masked.
  • Simplicity. No need to configure an external service.
  • ?

Why not do it this way

  • Scalability. Depending on the size of the book that's beeing generated, concatenation and PDF post-processing can be resource consuming processes. This may affect the MediaWiki environment that the scripts are running in.
  • ?

Executable files

Why do it this way

  • Availability. We can use something like Pyinstaller to generate stand-alone executables and call these executables from PHP.
  • ?

Why not do it this way

  • Complexity. We'll be adding one more step (in addition to the actual development of the codebase) - the execution generation step to achieve the same result. If a Python executable is available on the system already, there is no need to go this route. We may as well call python some_script.py.
  • ?

More info

Please chime in with your suggestions/improvements.

@faidon, @greg: We're in need of your guidance here: We're confident that Python is the right choice here as there are no good libraries for manipulating PDFs in PHP or Python (see T171964#3516675 onwards). However, we're not quite sure how best to use a Python script from a PHP extension.

Can our application servers run Python? Is there prior art for shelling out to a Python script on our application servers?

Also, sorry if you're not the best folk to ping about this. If that's the case then who is?

I don't know of any prior art of a python scrip on the app servers. But I do know that we have python services in production (like ORES and Striker).

phuedx renamed this task from [Spike - 8 hours] How should the HTML concatenation and PDF post-processing scripts be exposed for use by Extension:Collection to [Spike - 8 hours] How should the PDF post-processing script be exposed for use by Extension:Collection.Aug 25 2017, 11:19 AM
phuedx updated the task description. (Show Details)

@faidon - not sure if you had a chance to look at this yet. Sorry to double-ping you, but given our timeline, there's a bit of urgency around this.

Honestly... I'm not exactly sure what you're proposing :) Is there a design document or something that describes the architecture of the system you're thinking of implementing?

Take a look here [1]. The document was written before we decided to do the concatenation part in PHP, but other than that it should accurately describe what we're trying to do.

[1] https://www.mediawiki.org/wiki/User:Bmansurov_(WMF)/Alternative_way_of_generating_PDF_books

Hi, I took a look at your current proposal and I see a series of issues with it. I might still not have understood what you're proposing fully, if that's the case, please let me know!

So, let's start from the least preferrable option in my opinion:

Shelling out from MediaWiki is possible, but it's hardly recommended

There are several reasons for this:

  1. It goes in the opposite direction of a microservices-based architecture.
  2. It poses problems of security (you'll be handling pdfs on the same servers that hold all the security credentials for all our databases)
  3. Would be a wonderful DOS vector against mediawiki (pdf processing for large books takes time, and shelling out is syncronous).
  4. Shellouts from MediaWiki has several memory and file size limitations, for a good reason. We won't be raising for this specific case, and this can specifically hit both those limits easily

So I strongly oppose the idea of creating a python script and shelling out to it.

A separate service is not ideal either

The idea of having mediawiki that calls this new collections service X that on its own calls PDFRender. This creates a chain of calls that adds additional possible points of failure and makes debugging more complicated. In particular I'd like to understand why you can't implement this functionality inside the existing service, or (alternatively) why do we want to use PDFRender for this at all, if it's written in a language that doesn't have good pdf mangling libraries.

Still, I consider this option much better than shelling out

But even in this case, the architectural requirement to make requesting and fetching the collection asynchronous from MediaWiki (as it is today) in order not to have to fight with timeouts or - worse - with a denial of service vector.

The ideal solution (from and architectural prespective)

A self contained service that:

  • serves single pages directly (synchronously)
  • serves collections asyncrhonously if they take more than X seconds to compute (this might require at least some ephemeral storage service)

This could be done (for example) adding another endpoint to the current PDFRender service, where you can maybe shellout to python if needed.

This is a very high level overview of guidelines on what could/should be acceptable. I think more feedback/discussion about how this part of the service should be fit inside our overall architecture is needed.

@Joe thanks for the input. Just to be clear, there is no single proposal. We are looking at pros and cons of each approach.

It was not clear to me what you're referring to when you say PDFRender, but I suppose you mean Extension:ElectronPdfService? If I understand you correctly, from the ops perspective the best place to do PDF post-processing is via ElectronPdfService. Am I right?

Also, going through the remainder of the design document and the implementation PoC, I could summarize the flow as follows:

  1. MediaWiki requests the pages HTML to restbase
    1. Restbase will call parsoid to get that HTML if not cached
    2. Parsoid will call the MediaWiki API to render that HTML
  2. Something (not clear to me if php or a python script) joins this html together and sends it to pdfrender
  3. Pdfrender generates the pdf
  4. Some post-processing script adds a table of contents and page numbers

Can you confirm me this is the request flow you had in mind?

I see several issues with that request flow, but I'd like to confirm my understanding of it is sound.

  1. MediaWiki requests the pages HTML to restbase
    1. Restbase will call parsoid to get that HTML if not cached
    2. Parsoid will call the MediaWiki API to render that HTML

I'm not sure if we're calling the MW API or RESTBase. @Tgr could you clarify?

  1. Something (not clear to me if php or a python script) joins this html together and sends it to pdfrender

Extension:Collection will do it.

  1. Pdfrender generates the pdf

I think Pdf render is also a RESTBase end point. @Tgr is that correct? Or are we talking to Extension:ElectronPdfRender?

  1. Some post-processing script adds a table of contents and page numbers

Yes this is a Python script that we're building.

This is the control flow as proposed now:

  1. MediaWiki requests HTML of pages from RESTBase
    • RESTBase might fall back to Parsoid if not cached; Parsoid partially relies on the PHP API for rendering
  2. MediaWiki concatenates the pages into a single document ('concatenate' is a bit misleading; this requires parsing the HTML)
  3. MediaWiki POSTs the HTML to Electron (mediawiki/services/electron-render; RESTBase and ElectronPdfRender cannot be used as do not expose rendering arbitrary HTML, for obvious reasons)
  4. Electron responds with a PDF file
  5. MediaWiki shells out to Python which does some post-processing on the PDF file
  6. MediaWiki returns the PDF to the user.

It's all rather roundabout. To a large extent that's inherent in the problem: the user needs to communicate with the MediWiki UI, the post-processing needs to be done in Python (seems to be the only language with decent PDF editing support), Electron is a Node app (in theory we could use some Python-based way of remote-controlling Chrome, such as prerender, but that seems like a big project for relatively little benefit). So we could for example have a Python service instead and go MediaWiki -> web redirect to Python service -> RESTBase (->Parsoid->MW) -> Python service -> Electron -> Python service but it wouldn't be much of an improvement.

The (not necessarily permanent) choice of shelling out was made mainly based on time constraints: Ops wants to get rid of OCG very soon, and I doubt writing and deploying a new Python webservice would be a matter of days. Most of the pieces for the above arrangement are already in place, and we use firejail for limiting the security impact of some shell commands which presumably can be reused here (the impact seems relatively limited anyway; code execution vulnerabilities are rare in dynamic languages).

As for DOS, timeouts etc, the assumption was that rendering a PDF from a HTML page is much more expensive then the needed pre- and procesing. It involves building a DOM tree, then computing the styles and merging into a render tree, then layouting, then painting, then serializing to PDF... on the other hand, Remex is a streaming HTML processor which just makes a single pass through the document and can discard elements from memory once it encounters the close tag (so the memory use will be proportional to tree depth, not tree size), and PDF post-processing mostly involves PDF metadata and does not require parsing most of the content at all. (Presumably. I have no idea how pdfrw is actually implemented.) So anything that kills the PDF or Python parts would kill the Node part anyway.

At the Reading/Services sync a standalone Python service seemed to be the consensus option. It's superior from a security and scalability POV, but it would also expand the timeline of the project a bit, I imagine.
That raises the question: Services is considering replacing Electron with headless Chrome (T172815) and the latter is easily available in Python as well, so should that Python service become the replacement for Electron? @GWicke any thoughts?

(Re: the option of a Python package in T171965#3547239, I think that's orthogonal. Whether we end up with a Python script or web service or precompiled executable, the nice approach is definitely to separate the actual postprocessing logic into a package.)

@Tgr, at first sight it looks like there are reasonable python bindings for headless Chrome as well. Combined with the PDF post-processing library you have been testing, I could see a simple python service doing both pre/postprocessing and actual rendering work well. The service portion of either option is trivial in any case, and all the heavy lifting is in the libraries & Chrome.

This service would be locked down inside firejail, ideally without write access to the local filesystem, and can be restricted at the network level to only allow access public resources.

This is the control flow as proposed now:

  1. MediaWiki requests HTML of pages from RESTBase
    • RESTBase might fall back to Parsoid if not cached; Parsoid partially relies on the PHP API for rendering
  2. MediaWiki concatenates the pages into a single document ('concatenate' is a bit misleading; this requires parsing the HTML)
  3. MediaWiki POSTs the HTML to Electron (mediawiki/services/electron-render; RESTBase and ElectronPdfRender cannot be used as do not expose rendering arbitrary HTML, for obvious reasons)
  4. Electron responds with a PDF file
  5. MediaWiki shells out to Python which does some post-processing on the PDF file
  6. MediaWiki returns the PDF to the user.

Thanks for the thorough explanation.

At the Reading/Services sync a standalone Python service seemed to be the consensus option. It's superior from a security and scalability POV, but it would also expand the timeline of the project a bit, I imagine.
That raises the question: Services is considering replacing Electron with headless Chrome (T172815) and the latter is easily available in Python as well, so should that Python service become the replacement for Electron? @GWicke any thoughts?

(Re: the option of a Python package in T171965#3547239, I think that's orthogonal. Whether we end up with a Python script or web service or precompiled executable, the nice approach is definitely to separate the actual postprocessing logic into a package.)

If we decide to develop a service, I'd go the same way we went with ORES, creating a repository with wheels to deploy to a virtualenv, then serve the application via uwsgi.

This service would replace the current electron pdf renderer as well on the medium/long run, right?

This service would replace the current electron pdf renderer as well on the medium/long run, right?

Yes, and in doing so would resolve T172815 as well.

Thanks everyone for comments. Looks like the decision is to do this in a separate Python service.

AFAICT all technical stakeholders (Ops, RelEng, and Services) are satisfied that building the post-processing step as a stateless web service is the correct way to proceed.

If we decide to develop a service, I'd go the same way we went with ORES, creating a repository with wheels to deploy to a virtualenv, then serve the application via uwsgi.

Any links documentation about that process would be appreciated!

This service would replace the current electron pdf renderer as well on the medium/long run, right?

Readers Web/Infra and Services have yet to talk this through.