Page MenuHomePhabricator

Error handling in Parsoid
Closed, ResolvedPublic

Description

Parsoid has this pattern to env.log('fatal', (or fatal/request to return a HTTP 500) when it enters a bad state (ie. catching a thrown error). I'm not too keen on it and would like to see errors returned to callers, rather than magically killing the process with the logger. We also ran into a bug on an old version of express where trying to write headers when content has already been sent gets into an infinite loop, which is why our api utils have all these responseSent flags. That happened when the logger sent a 500 midstream but then the async processing of the page finished and tried to send an (albeit undesirable) response.

I did some preliminary inspection of what it's going to take for a refactor.

  1. The token transformers need to emit errors and handle error events.

2) The serializer has all these cbs which aren't really callbacks, they are accumulators. We should extend that to accumulate errors as well (as the first arg). Then when the serialization is done, return both.

Event Timeline

Arlolra claimed this task.
Arlolra raised the priority of this task from to Medium.
Arlolra updated the task description. (Show Details)
Arlolra added a project: Parsoid.
Arlolra subscribed.
Arlolra set Security to None.
Arlolra updated the task description. (Show Details)

@ssastry is -1 on me doing this until after vacation.

  1. The serializer has all these cbs which aren't really callbacks, they are accumulators. We should extend that to accumulate errors as well (as the first arg). Then when the serialization is done, return both.

Error handling in the serializer is much improved after T115720.

See https://gerrit.wikimedia.org/r/#/c/257944/8/lib/wt2html/tt/LinkHandler.js

This is the async phase, right? So throwing here will likely return the correct http error to the client, but doesn't do anything to stop Parsoid from continuing to the parse.

We really need to improve error handling during wt2html, T110961.

This is why the cpu timeout doesn't always work. Sometimes the processing will yield enough for the request timeout to fire (and disable the cpu timeout) but that doesn't actually halt processing.

I think we determined a while back that the request timeout doesn't really do anything useful (because of the bad error handling in this phase) beyond masking the problem and getting in the way of the cputimeout (which is much harsher). Should we remove it entirely?

Another option worth considering is to use pure compute workers for the actual request processing:

  • Only run a small number of cluster workers handling HTTP requests.
  • Each cluster worker maintains a compute worker pool, and dispatches any actual parsing to those workers. Each compute worker processes only a single client request at a time.
  • Enforce memory and time limits on compute workers by killing them hard, and returning appropriate error HTTP responses from the cluster worker.

Ascii art:

service-runner master
  service-runner cluster worker
    parsoid compute worker
    parsoid compute worker
    ...
  service-runner cluster worker
    parsoid compute worker
    parsoid compute worker
    ....

To my knowledge, this is the most reliable way to limit CPU and memory resources per request, without affecting concurrent requests. An in-process alternative could be to use the VM module, but this only offers time limits, and will still break concurrent requests on OOM.

Definitely worth considering. I think I was suggesting something similar in T90668#1951210

OCG's ascii art would be more like this though,

service-runner master
  service-runner cluster http worker
  service-runner cluster http worker
  ...
  service-runner cluster compute worker
  service-runner cluster compute worker
  service-runner cluster compute worker
  ...

with I guess more orchestration between workers (it's using Redis).

Your proposal seems simpler on the face of it, which is good.

Arlolra raised the priority of this task from Medium to Medium.Jun 15 2017, 10:01 PM

And for,

  1. The token transformers need to emit errors and handle error events.

that will be done instead in T183327 so I'll close in favour of finishing there.

This issue is referenced here: https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L152

The process.exit is causing issues in MWOffliner, do we have an estimate of timings?

@Isnit001, we may never get to some of these issues.

Parsoid is currently being ported to PHP and once that is verified correct and deployed, we will be retiring the JS version. But, there will be one more release of the Parsoid JS version later this year which is likely to be the last release of the JS version.

See https://www.mediawiki.org/wiki/Tech_talks#Episode_1:_The_long_and_winding_road_to_making_Parsoid_the_default_mediaWiki_parser for where Parsoid is headed.

The Parsoid API itself will not change, but it will no longer be an independent service as it is now. We should separately discuss what this means for Kiwix at some point in the coming months.