Page MenuHomePhabricator

Stress test Parsoid's HTTP API
Closed, ResolvedPublic

Description

In T204624 (and maybe T198421), we saw a failure scenario where, under a higher than expected request rate, Parsoid continued to fulfill some requests but did not respond with a 503 when its queues where presumably full and instead led to socket timeouts on the client side.

That task was closed after tuning change prop for the beta cluster, but without addressing the issue in Parsoid.

Yesterday's production incident looked somewhat similar, so here we are again.

Event Timeline

Parsoid was in a bit of trouble again today. At 02:44 XioNoX: depool eqsin, so all the mobile traffic for Chinese wiki started hitting RESTBase. Since Chinese has variants, RB started requesting Parsoid to transform HTML into correct variant. As the transformation request rate reached roughly 40 r/s, Parsoid started experiencing troubles, alerting and timing out.

I would like to prioritize this issue, even though node Parsoid is being replaced, depooling an edge CDN DC should not be able to bring down our infrastructure. Also, 40 r/s for lang variant transforms seem quite low to affect Parsoid cluster so drastically. I believe there's a bug somewhere.

Parsoid was in a bit of trouble again today. At 02:44 XioNoX: depool eqsin, so all the mobile traffic for Chinese wiki started hitting RESTBase. Since Chinese has variants, RB started requesting Parsoid to transform HTML into correct variant. As the transformation request rate reached roughly 40 r/s, Parsoid started experiencing troubles, alerting and timing out.

I would like to prioritize this issue, even though node Parsoid is being replaced, depooling an edge CDN DC should not be able to bring down our infrastructure. Also, 40 r/s for lang variant transforms seem quite low to affect Parsoid cluster so drastically. I believe there's a bug somewhere.

Req rate is only 40reqs /s and p99 time for lang conversion is ~5s and median is ~250ms. Regular wt2html reqs are much slower (p99 are 50s and median is 1.5s) and we handle 90 reqs/s. So, it is pretty baffling that a spike in lang conv. is causing Parsoid cluster to be affected adversely.

One thought is that something in the code is not yielding to the event loop properly and locking up workers.

Even so, each server in the cluster runs 40 worker processes and we have 20 in the codfw cluster ... so, that is 800 worker processes. So, even if every lang variant req took 5s and locked up one worker, that would only hold up 200 workers, and nothing to cause the cluster to panic.

So, there is likely a bug here.

Pretty sure the following patch is the problem of why Parsoid's maxConcurrentCalls isn't being respected.

https://github.com/rvagg/node-worker-farm/pull/89

Pretty sure the following patch is the problem of why Parsoid's maxConcurrentCalls isn't being respected.

That results in an unbounded cpu worker queue, and should be fixed, but isn't necessarily the problem here.

Language variant requests aren't processed in the cpu workers,
https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L636
https://github.com/wikimedia/parsoid/blob/master/lib/api/apiUtils.js#L586-L637

The current topology is,

1 master
|_ 1 http worker
     |_ cpu worker
     |_ cpu worker
     |_ ... number of cpu times

Meaning, there're only n (where, n is the number of servers, not cpus) processes handling those requests.

We should increase num_workers: 1 slightly,
https://github.com/wikimedia/mediawiki-services-parsoid-deploy/blob/master/scap/templates/config.yaml.j2#L4

And also move language conversion, redlink, and other pb2pb work into lib/parse.js for the cpu workers to handle.

Why is num_workers set to 1 there? I'm reading the git commit message that changed it, but I don't understand the rationale. Speaking of workers, is there a rationale for using worker-farm rather than node's cluster module to spawn more workers?

Why is num_workers set to 1 there? I'm reading the git commit message that changed it, but I don't understand the rationale.

In the description above, num_worker corresponds to the number of "http workers" that are spawned by service-runner. Their job is to accept requests, queue up work for the "cpu workers", and send responses. The idea was that, since they aren't supposed to be handling any business logic, 1 should be sufficient.

Speaking of workers, is there a rationale for using worker-farm rather than node's cluster module to spawn more workers?

So as not to have to rewrite the features of worker-farm in terms of cluster.

I see, makes sense. Given the current load on the machines, I think it makes sense to up num_workers to at least 2 or 3.

Change 492757 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid/deploy@master] Bump num_workers to 3

https://gerrit.wikimedia.org/r/492757

Change 492757 merged by jenkins-bot:
[mediawiki/services/parsoid/deploy@master] Bump num_workers to 3

https://gerrit.wikimedia.org/r/492757

Change 492931 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Move language conversion work into lib/parse.js

https://gerrit.wikimedia.org/r/492931

Change 492935 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Use fork of worker-farm

https://gerrit.wikimedia.org/r/492935

Change 492935 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Use fork of worker-farm

https://gerrit.wikimedia.org/r/492935

Change 492931 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Move language conversion work into lib/parse.js

https://gerrit.wikimedia.org/r/492931

Change 493283 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Move redlink updating into lib/parse.js

https://gerrit.wikimedia.org/r/493283

Change 493283 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Move redlink updating into lib/parse.js

https://gerrit.wikimedia.org/r/493283

For the record, this is what was done from T214099#4981608,

  • Process language variant and other conversion logic in cpu workers
  • Increase the number of http workers per server
  • Fix maxConcurrentCalls in worker-farm