Page MenuHomePhabricator

Parsoid/PHP performance benchmarking on scandium / eqiad cluster
Closed, ResolvedPublic

Description

Pick a subset of titles form different wikis. Write a script to hit Parsoid/JS and Parsoid/PHP for all these titles and run performance benchmarking for different modes (wt -> html, html -> wt, wt -> pagebundle, pagebundle -> wt, selser with some random edits maybe). The script might use other tools like ab, for example.

This will let us give us a more solid measure of performance comparison between Parsoid/JS and Parsoid/PHP.

Details

Related Gerrit Patches:
mediawiki/services/parsoid : masterAdd a benchmarking script

Event Timeline

ssastry created this task.Sep 6 2019, 4:11 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 6 2019, 4:11 AM
ssastry triaged this task as Medium priority.Sep 6 2019, 4:11 AM
ssastry moved this task from Backlog to Performance on the Parsoid-PHP board.
Krinkle claimed this task.Sep 30 2019, 8:02 PM
Krinkle moved this task from Inbox to Doing on the Performance-Team board.
Gilles moved this task from Doing to Radar on the Performance-Team board.Sep 30 2019, 8:02 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

@ssastry Do you have a workload of titles or urls in mind already? If not, one or multiple of the following might be worth considering:

  1. If we want a relatively small set of "high priority" urls of which we could demand little to no impact, perhaps use large.dblist in combination with AQS to get the top 200 articles of the last month or something like that.
  2. If we want a diverse set of titles, to catch more rare issues, perhaps use open.dblist (expand with expanddblist from wmf-config or from mwmaint), and query e.g. 500 random content-namespace pages from the API.
  3. If we want a representative workload from rod, another way might be to replay traffic. E.g. get 1 hour's worth of HEAD+GET requests from a single app server (I don't think we have local logs anymore, but this can be captured from varnishlog by looking at backend requests only). Then extract things like title, and oldid from urls to create the dataset. Perhaps API titles/revids/pageids as well.

We have a bunch of scripts in https://github.com/wikimedia/mediawiki-services-parsoid-testreduce/tree/master/server/scripts that we've used to generate large samples of production test pages based on source wikis, desired proportion of random pages and recent edits, etc. Roundtrip testing, visual diff testing, tidy-remex testing, were all populated with those scripts.

I was thinking of using that, but need to assess whether those scripts meet our needs are whether we should something like what you suggest here.

Krinkle removed Krinkle as the assignee of this task.Oct 1 2019, 1:57 AM
ssastry added a subscriber: Gilles.Oct 1 2019, 9:51 PM

@Krinkle, @Gilles The REST API endpoints won't be available publicly, at least initially. So, what benchmarking tools should we use on scandium (or one of the wtp production servers once they become available for testing)? ab is readily available, but it is a tad slow and also needs to be part of a script to do the kind of benchmarking we want to do (with multiple concurrent requests from a bit set of urls). We can script but if there is a way to avoid that and use some other tool, that would be ideal.

with multiple concurrent requests from a bit set of urls

So you want the tool to hit a bunch of different URLs concurrently at the same time? Is the idea that the requests would be fed in the same order and at the same pace as the real world, and seeing how both endpoints do in terms of average total response latency?

with multiple concurrent requests from a bit set of urls

So you want the tool to hit a bunch of different URLs concurrently at the same time? Is the idea that the requests would be fed in the same order and at the same pace as the real world, and seeing how both endpoints do in terms of average total response latency?

Yes, that is the idea for one of the benchmarking modes. But, not necessarily the same order, but, at least roughly the same request rate and some random mix of urls.

https://grafana.wikimedia.org/d/000000048/parsoid-timing-wt2html?panelId=37&fullscreen&orgId=1&from=now-30d&to=now shows reparse rate is ~90 reqs a sec across the codfw cluster of 20 servers which gives us a 4.5 req/s average per server ... but will probably round off to 5 reqs / sec to account for the other two request types.

We'll have a production cluster server depooled as well to run the benchmarking on that server since we'll likely run both Parsoid/JS & Parsoid/PHP at the same time on all the servers for a few weeks to make sure all is well. So, this benchmarking will also let us know if the servers can sustain that dual load or not OR if we need to wait to commission new servers before we can do so.

Gilles added a comment.Oct 3 2019, 1:58 PM

Have you considered shadow dual-serving the requests from real production traffic? This is what I did during the Thumbor rollout. Essentially direct the requests coming from real traffic both to the usual stack and to the new one as well. Parsoid/PHP would simply save its results in a different place (but really save it, to keep things realistic). For Thumbor it was using a URL storage prefix in Swift that we were able to easily delete once those tests were done.

It can be merely a portion of the real traffic. And you'd instrument both sides to compare. Ideally you want this to work with the same amount of requests going to both servers, and the servers having similar specs.

It really doesn't get any closer to what things will be really like once you actually migrate than that.

This is also a good way to discover bugs, by comparing output. I understand that the output might not be identical, but if you flag things when the output is say > 5% different, you might find interesting bugs before rolling out to production.

That's what let me fix most of the bugs in Thumbor before it started really serving production traffic. In fact I didn't push this far enough, as I was only comparing HTTP status and response size. But some bugs that were found once in production were more subtle than that (eg. messed up colors), and I should have gone as far as making a visual comparison, or at least a histogram one. Similar ideas might be applicable here. In the end I expected that phase to be a formality before the production deployment and actually ended up spending a couple of months fixing all the bugs that the shadow dual serving surfaced. And when the actual rollout happened, the launch was a non-event, as 99% of the bugs when exposed to real traffic had been fixed already.

Mirroring real production traffic is indeed the idea ( T229019 ) for the reasons you mention there -- to identify any additional crashers not uncovered by our regular rt-testing on scandium (with ~150K production pages). Changeprop traffic is not live client traffic (edits requests from VE & CX) but all reparse traffic which is >90% of all Parsoid requests.

But, the question we want to answer here is: can we use the existing Parsoid/JS cluster to serve both Parsoid/JS & Parsoid/PHP requests without needing to wait on new servers? It seems like we should be able to since codfw cluster has < 20% load but we want to verify that hunch via benchmarking.

As for HTML comparison, that was also in the works (by implementing normalizations for things known to differ but which doesn't matter), but we plan to do that on scandium only, not with production traffic We think the sample size on scandium is big enough to give us sufficient confidence. We might alternatively run a visual diff run comparing renderings instead. Something we'll resolve by next week.

Gilles added a comment.Oct 3 2019, 5:47 PM

If colocated, just start with 1% of the traffic being dual-served to the new system and ramp it up gradually? If the worry is that the new service would deplete resources on the machine.

With systemd you should be able to peg some cores to either service. That might be a way to ensure that they roughly have the same dedicated resources while being colocated, and that the OS isn't doing something funny with scheduling between the 2 that would give an advantage to one or the other. And more importantly, it protects you about the new service potentially going crazy and overwhelming the CPU, as it would only max out one core at most.

cscott added a subscriber: cscott.Oct 3 2019, 5:57 PM

What about splitting the hosts, and directing JS traffic to one half of them and PHP traffic to the other half? Then by changing exactly what machines are in each pool we can reallocate resources as necessary, since every machine would be capable of responding to either PHP or JS requests.

Gilles added a comment.EditedOct 3 2019, 6:05 PM

That's what I suggested initially. If the machines are used at 20% of their capacity, it does sound like you could afford to split the cluster in half.

If colocated, just start with 1% of the traffic being dual-served to the new system and ramp it up gradually? If the worry is that the new service would deplete resources on the machine.

That was the plan when we started sending live *edit* client traffic to the clusters, but that was from a correctness POV so we verify correctness before increasing traffic. But, we could possibly do that with this initial shadow deployment as well. Depends what is involved in changeprop to enable that. @Pchelolo might know.

With systemd you should be able to peg some cores to either service. That might be a way to ensure that they roughly have the same dedicated resources while being colocated, and that the OS isn't doing something funny with scheduling between the 2 that would give an advantage to one or the other. And more importantly, it protects you about the new service potentially going crazy and overwhelming the CPU, as it would only max out one core at most.

I'll check with SRE about that.

What about splitting the hosts, and directing JS traffic to one half of them and PHP traffic to the other half? Then by changing exactly what machines are in each pool we can reallocate resources as necessary, since every machine would be capable of responding to either PHP or JS requests.

That was the very first proposal, but that was not favored by @mobrovac and SRE, afaik. @Dzahn can clarify.

If colocated, just start with 1% of the traffic being dual-served to the new system and ramp it up gradually? If the worry is that the new service would deplete resources on the machine.

That was the plan when we started sending live *edit* client traffic to the clusters, but that was from a correctness POV so we verify correctness before increasing traffic. But, we could possibly do that with this initial shadow deployment as well. Depends what is involved in changeprop to enable that. @Pchelolo might know.

This would be better done in RESTBase. We could easily mirror the traffic there and simply discard the value given by Parsoid/PHP (in a first iteration). And it's easy to control the percentage that way too.

With systemd you should be able to peg some cores to either service. That might be a way to ensure that they roughly have the same dedicated resources while being colocated, and that the OS isn't doing something funny with scheduling between the 2 that would give an advantage to one or the other. And more importantly, it protects you about the new service potentially going crazy and overwhelming the CPU, as it would only max out one core at most.

I'll check with SRE about that.

Heh, this is php-fpm, so it's a bit trickier. But, I don't think load will ultimately be a problem. First, we need to assess the correctness of Parsoid/PHP, which means sending a small percentage of live traffic to it. Then, we will gradually be switching wikis, which means that the load will increase for Parsoid/PHP but decrease for Parsoid/JS.

What about splitting the hosts, and directing JS traffic to one half of them and PHP traffic to the other half? Then by changing exactly what machines are in each pool we can reallocate resources as necessary, since every machine would be capable of responding to either PHP or JS requests.

That was the very first proposal, but that was not favored by @mobrovac and SRE, afaik. @Dzahn can clarify.

Assuming what I wrote is correct wrt the migration path, then it's easier to do the switch if they are collocated: puppet-wise, we'll have uniformity across all wtp* hosts, while also not needing to switch LVS back-ends every time a host is converted. Instead, we'd keep both variants on the hosts (with full LVS lists), and then just drop the superfluous puppet classes and LVS once we are sure Parsoid/PHP is doing well enough to be the only variant. If the latter is not the case, then we'd already have Parsoid/JS' full power set up on the hosts to switch back to it in case we need to do so.

Change 544066 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Add a benchmarking script

https://gerrit.wikimedia.org/r/544066

ssastry renamed this task from Parsoid/PHP performance benchmarking on scandium to Parsoid/PHP performance benchmarking on scandium / eqiad cluster.Oct 21 2019, 3:57 PM
ssastry closed this task as Resolved.EditedOct 21 2019, 4:07 PM
ssastry claimed this task.

It turns out that given the active-passive MediaWiki setup, Parsoid/PHP will run in the active cluster (eqiad) compared to Parsoid/JS (codfw). So, we needn't worry about splitting the cluster.

With https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/544066, I ran benchmarks on both scandium and wtp1025. Initially I picked with a set of 25 predefined titles and then issued 50C requests where C is the concurrency level and ran with 2,4,6,8,10,12,14,16. It revealed that Parsoid/PHP mean response times were consistently better than Parsoid/JS. As concurrency rates went up, the gap between the two started to narrow but Parsoid/PHP continued to remain faster.

Later on, I grabbed a 20K wt2html request snapshot from a 90-min window on wtp2001. I then issued 100C requests (randomly sampled from the snapshot) where C is the concurrency level for C=4,5,6,7,8,9,10,12,14,16 I also ran another test with 6000 requests with a concurrency level of 6.

On wtp1025, across the board, median response time for Parsoid/PHP was 0.5x that of Parsoid/JS with minor variations. The mean response time gap narrowed as the value of C increased, but Parsoid/PHP mean time was 0.55x - 0.72x of Parsoid/JS.

We will still need to benchmark html->wt (with and without selective serialization) and html->html requests. Those are more involved and will be done separately. T232222 already indicates that the picture for html->wt will be different compared to wt->html. But, for the purposes of initial shadow deploy where wt->html performance was the biggest concern, I consider this task done.

Change 544066 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Add a benchmarking script

https://gerrit.wikimedia.org/r/544066

DannyS712 added a subscriber: DannyS712.

[batch] remove patch for review tag from resolved tasks