Stress test Parsoid's HTTP API
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Arlolra
	Jan 17 2019, 9:56 PM

Description

In T204624 (and maybe T198421), we saw a failure scenario where, under a higher than expected request rate, Parsoid continued to fulfill some requests but did not respond with a 503 when its queues where presumably full and instead led to socket timeouts on the client side.

That task was closed after tuning change prop for the beta cluster, but without addressing the issue in Parsoid.

Yesterday's production incident looked somewhat similar, so here we are again.

Details

Subject	Repo	Branch	Lines +/-
Move redlink updating into lib/parse.js	mediawiki/services/parsoid	master	+67 -39
Move language conversion work into lib/parse.js	mediawiki/services/parsoid	master	+96 -67
Use fork of worker-farm	mediawiki/services/parsoid	master	+3 -3
Bump num_workers to 3	mediawiki/services/parsoid/deploy	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T202905: Outreach-17 Project: Add a new Linter Category: Links-in-Links
T217093: Cannot read property 'docId' of undefined
T204608: Use a bag-on-the-side implementation, rather than an internal .dataobject for node data
Mentioned Here: rGPAR16603953e234: Fix setting dsr on body for genTest
T202905: Outreach-17 Project: Add a new Linter Category: Links-in-Links
rGPARe82347d13972: Use fork of worker-farm
T204608: Use a bag-on-the-side implementation, rather than an internal .dataobject for node data
T217093: Cannot read property 'docId' of undefined
T198421: VE is not loading on Beta Cluster, getting 503s
T204624: Parsoid is misbehaving in Beta cluster

Event Timeline

Arlolra created this task.Jan 17 2019, 9:56 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2019, 9:56 PM

• Pchelolo added a project: Services (watching).Jan 17 2019, 10:32 PM

Parsoid was in a bit of trouble again today. At 02:44 XioNoX: depool eqsin, so all the mobile traffic for Chinese wiki started hitting RESTBase. Since Chinese has variants, RB started requesting Parsoid to transform HTML into correct variant. As the transformation request rate reached roughly 40 r/s, Parsoid started experiencing troubles, alerting and timing out.

I would like to prioritize this issue, even though node Parsoid is being replaced, depooling an edge CDN DC should not be able to bring down our infrastructure. Also, 40 r/s for lang variant transforms seem quite low to affect Parsoid cluster so drastically. I believe there's a bug somewhere.

In T214099#4970637, @Pchelolo wrote:

Parsoid was in a bit of trouble again today. At 02:44 XioNoX: depool eqsin, so all the mobile traffic for Chinese wiki started hitting RESTBase. Since Chinese has variants, RB started requesting Parsoid to transform HTML into correct variant. As the transformation request rate reached roughly 40 r/s, Parsoid started experiencing troubles, alerting and timing out.

I would like to prioritize this issue, even though node Parsoid is being replaced, depooling an edge CDN DC should not be able to bring down our infrastructure. Also, 40 r/s for lang variant transforms seem quite low to affect Parsoid cluster so drastically. I believe there's a bug somewhere.

Req rate is only 40reqs /s and p99 time for lang conversion is ~5s and median is ~250ms. Regular wt2html reqs are much slower (p99 are 50s and median is 1.5s) and we handle 90 reqs/s. So, it is pretty baffling that a spike in lang conv. is causing Parsoid cluster to be affected adversely.

One thought is that something in the code is not yielding to the event loop properly and locking up workers.

Even so, each server in the cluster runs 40 worker processes and we have 20 in the codfw cluster ... so, that is 800 worker processes. So, even if every lang variant req took 5s and locked up one worker, that would only hold up 200 workers, and nothing to cause the cluster to panic.

So, there is likely a bug here.

Arlolra claimed this task.Feb 21 2019, 3:35 PM

Pretty sure the following patch is the problem of why Parsoid's maxConcurrentCalls isn't being respected.

https://github.com/rvagg/node-worker-farm/pull/89

Pretty sure the following patch is the problem of why Parsoid's maxConcurrentCalls isn't being respected.

That results in an unbounded cpu worker queue, and should be fixed, but isn't necessarily the problem here.

Language variant requests aren't processed in the cpu workers,
https://github.com/wikimedia/parsoid/blob/master/lib/api/routes.js#L636
https://github.com/wikimedia/parsoid/blob/master/lib/api/apiUtils.js#L586-L637

The current topology is,

1 master
|_ 1 http worker
     |_ cpu worker
     |_ cpu worker
     |_ ... number of cpu times

Meaning, there're only n (where, n is the number of servers, not cpus) processes handling those requests.

We should increase num_workers: 1 slightly,
https://github.com/wikimedia/mediawiki-services-parsoid-deploy/blob/master/scap/templates/config.yaml.j2#L4

And also move language conversion, redlink, and other pb2pb work into lib/parse.js for the cpu workers to handle.

Why is num_workers set to 1 there? I'm reading the git commit message that changed it, but I don't understand the rationale. Speaking of workers, is there a rationale for using worker-farm rather than node's cluster module to spawn more workers?

Why is num_workers set to 1 there? I'm reading the git commit message that changed it, but I don't understand the rationale.

In the description above, num_worker corresponds to the number of "http workers" that are spawned by service-runner. Their job is to accept requests, queue up work for the "cpu workers", and send responses. The idea was that, since they aren't supposed to be handling any business logic, 1 should be sufficient.

Speaking of workers, is there a rationale for using worker-farm rather than node's cluster module to spawn more workers?

So as not to have to rewrite the features of worker-farm in terms of cluster.

I see, makes sense. Given the current load on the machines, I think it makes sense to up num_workers to at least 2 or 3.

Change 492757 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid/deploy@master] Bump num_workers to 3

https://gerrit.wikimedia.org/r/492757

gerritbot added a project: Patch-For-Review.Feb 25 2019, 7:54 PM

Change 492757 merged by jenkins-bot:
[mediawiki/services/parsoid/deploy@master] Bump num_workers to 3

https://gerrit.wikimedia.org/r/492757

Change 492931 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Move language conversion work into lib/parse.js

https://gerrit.wikimedia.org/r/492931

Change 492935 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Use fork of worker-farm

https://gerrit.wikimedia.org/r/492935

Change 492935 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Use fork of worker-farm

https://gerrit.wikimedia.org/r/492935

Mentioned in SAL (#wikimedia-operations) [2019-02-26T18:24:33Z] <arlolra> Updated Parsoid to e82347d (T204608, T214099, T217093)

Stashbot mentioned this in T217093: Cannot read property 'docId' of undefined.Feb 26 2019, 6:24 PM

Change 492931 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Move language conversion work into lib/parse.js

https://gerrit.wikimedia.org/r/492931

Change 493283 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] Move redlink updating into lib/parse.js

https://gerrit.wikimedia.org/r/493283

Change 493283 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Move redlink updating into lib/parse.js

https://gerrit.wikimedia.org/r/493283

Arlolra closed this task as Resolved.Feb 27 2019, 10:39 PM

Mentioned in SAL (#wikimedia-operations) [2019-03-04T22:15:05Z] <arlolra> Updated Parsoid to 1660395 (T214099, T202905)