Page MenuHomePhabricator

Parsoid migration to php 7.4
Closed, ResolvedPublic

Description

Given parsoid doesn't receive traffic from the public, and the fact we've noticed huge performance gains when using php 7.4 for parsing, we want to transition parsoid ASAP.

This would go as follows:

  • First we'll transition scandium, the parsoid test server, to use php7.4. @ssastry please let us know when that would be acceptable for your team. Ideally we'd like you to run a full batch of tests before we move any traffic too.
  • Then we'll start installing the new parse1* servers with php7.4 only, and progressively add them to the rotation, removing the old servers in the process. This will allow us to move traffic by 4% chunks (1/24)

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
ResolvedJdforrester-WMF
Resolved toan
ResolvedLucas_Werkmeister_WMDE
ResolvedJoe
ResolvedJdforrester-WMF
ResolvedLadsgroup
InvalidNone
ResolvedReedy
OpenNone
Resolvedtstarling
ResolvedJdforrester-WMF
ResolvedPRODUCTION ERRORLegoktm
Resolvedtstarling
ResolvedJoe
ResolvedClement_Goubert
ResolvedClement_Goubert
ResolvedClement_Goubert

Event Timeline

Joe triaged this task as High priority.Jul 26 2022, 3:34 PM

You can switch scandium to php 7.4 this week, and we can redo baseline test runs.

But, please give me a heads up before you do that switch so we can make sure not to kick off test runs in that period OR ask you to wait for an additional few hours for any ongoing test run to finish.

Change 817699 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] parsoid::testing: install php 7.4

https://gerrit.wikimedia.org/r/817699

Change 817701 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] parsoid::testing: switch to php 7.4 by default

https://gerrit.wikimedia.org/r/817701

Once we've moved scandium, my plan for parsoid would be to install the new hardware we got with php 7.4 only and progressively swap out old hardware for the new one, thus moving traffic towards php 7.4 more and more.

Change 817699 merged by Giuseppe Lavagetto:

[operations/puppet@production] parsoid::testing: install php 7.4

https://gerrit.wikimedia.org/r/817699

@subbu would it be ok if we switched to php 7.4 tomorrow, July 28th, at 9:00 UTC?

I have all the patches lined up to do that.

Let us know when it's done and we can kick off an rt-test run to kick the tires.

Change 817701 merged by Giuseppe Lavagetto:

[operations/puppet@production] parsoid::testing: switch to php 7.4 by default

https://gerrit.wikimedia.org/r/817701

@cscott @ssastry it's done now, all requests without an explicit PHP_ENGINE cookie will be routed to php 7.4 on scandium.

Quick summary:

After an initial hiccup during which roundtrip testing was broken for about a week, we got a test run in y'day. Based on rough estimations (eyeball comparison of area under the scandium load curve during testing), performance seems to have improved about 8-10%. This is the total time across about 180K pages (of varying sizes) and includes wt -> html, html -> wt (without selective serialization), and html -> wt (with selective serialization).

There were some warnings in logstash on about 4 pages in the test run that seem to comes from Math rendering. So, unless that is somehow related to PHP 7.4, I think the test themselves are clean. No unexpected regressions or errors or fatals.

(You will have to zoom into the respective test window period for logstash and grafana).

Overall, everything seems good so far. There is another test run that will probably complete in about 2-3 hours and once that is done, I'll reconfirm.

The new test run completed and the perf is roughly similar. As for the logstash warning, looking at a 3-month window, I see identical warnings from May and June test runs.

So, I think we are good to go with bumping PHP versions on the production cluster.

@Arlolra @cscott. FYI.

Change 827498 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] parsoid: Install parse1* servers with only php 7.4

https://gerrit.wikimedia.org/r/827498

Change 827498 merged by Clément Goubert:

[operations/puppet@production] parsoid: Install parse1* servers with only php 7.4

https://gerrit.wikimedia.org/r/827498

Change 827513 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] parsoid: Add parse1* servers to dsh

https://gerrit.wikimedia.org/r/827513

Change 827513 merged by Clément Goubert:

[operations/puppet@production] parsoid: Add parse1* servers to conftool

https://gerrit.wikimedia.org/r/827513

Mentioned in SAL (#wikimedia-operations) [2022-08-29T16:08:09Z] <claime> pooled parse1001.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638

parse1001.eqiad.wmnet pooled in place of wtp1034.eqiad.wmnet
Now serving 4% of parsoid traffic from php7.4 only.

Icinga downtime and Alertmanager silence (ID=9b43b97d-2497-4b6c-848a-987b749a898e) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 24 host(s) and their services with reason: Downtiming php7.4 parsoid servers until they are ready to pool

parse[1001-1024].eqiad.wmnet

Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1001 for host parse1002.eqiad.wmnet with OS buster

Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1001 for host parse1002.eqiad.wmnet with OS buster completed:

  • parse1002 (WARN)
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh buster OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202208310917_cgoubert_297807_parse1002.out
    • Checked BIOS boot parameters are back to normal
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Failed to run httpbb tests
    • Forced a re-check of all Icinga services for the host
    • Icinga status is not optimal, downtime not removed
    • Updated Netbox data from PuppetDB

@Clement_Goubert Just wanted to share something, it's not a problem at all but Icinga monitoring does alert on hosts "not being in dsh groups". This may sound strange at first without context. Example link that currently alerts for parse1002 is:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=parse1002&service=mediawiki-installation+DSH+group

It's called that because back in the days we used https://wikitech.wikimedia.org/wiki/Dsh for scap deployments. Then the static files with lists of servers were replaced with https://wikitech.wikimedia.org/wiki/Conftool.

So what it means by "in dsh groups" is actually "it's not pooled in conftool" (does not appear at https://config-master.wikimedia.org/pybal/eqiad/parsoid-php)
but it's also not completely removed yet.

P.S. the cookbook was supposed to set the dowtime for that but that failed for some reason, so nothing you were expected to do differently. I ACKed it manually in Icinga.

P.S. the cookbook was supposed to set the dowtime for that but that failed for some reason, so nothing you were expected to do differently. I ACKed it manually in Icinga.

That's not correct. The cookbooks worked as expected and the downtime was properly set:

Downtimed the new host on Icinga/Alertmanager

And was not removed at the end of the cookbook because Icinga was not green:

Icinga status is not optimal, downtime not removed

It then just expired.

The last item is highlighted in Italic in the report and not bold because it's a warning as some hosts after reimage are not green by design until manual intervention.
And the actual failing checks are reported in the console:

2022-08-31 10:05:14,264 cgoubert 297807 [WARNING] [14/15, retrying in 42.00s] Attempt to run 'spicerack.icinga.IcingaHosts.wait_for_optimal.
<locals>.check' raised: Not all services are recovered: parse1002:mediawiki-installation DSH group
[...SNIP...]
2022-08-31 10:05:57,228 cgoubert 297807 [WARNING] //Icinga status is not optimal, downtime not removed//

When that happen the operator should check the failing checks and act accordingly.

Then the downtime expired 2h after it was set, at 11:33:

Aug 31 11:33:07 alert1001 icinga: HOST DOWNTIME ALERT: parse1002;STOPPED; Host has exited from a period of scheduled downtime

Fopr the record, that failure was expected and Clement was well aware of

@Clement_Goubert Just wanted to share something, it's not a problem at all but Icinga monitoring does alert on hosts "not being in dsh groups". This may sound strange at first without context. Example link that currently alerts for parse1002 is:

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=parse1002&service=mediawiki-installation+DSH+group

It's called that because back in the days we used https://wikitech.wikimedia.org/wiki/Dsh for scap deployments. Then the static files with lists of servers were replaced with https://wikitech.wikimedia.org/wiki/Conftool.

So what it means by "in dsh groups" is actually "it's not pooled in conftool" (does not appear at https://config-master.wikimedia.org/pybal/eqiad/parsoid-php)
but it's also not completely removed yet.

Clement is aware of all of the above, we just had a downtime expire on us because of T316601. They are working with me on this.

Icinga downtime and Alertmanager silence (ID=e0342c4d-b0d1-490b-b740-0d2962a32ac0) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 1 host(s) and their services with reason: Readding downtime removed by reimage

parse1002.eqiad.wmnet

What happened is, I downtimed all parse1* for a week while we worked on them with @Joe
Reimage of parse1002 reset that to 2h for that host, which led to Icinga alerting. As mentioned above, I re-added the downtime, and we'll work on adding these hosts to conftool soon.

Change 828786 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/mediawiki-config@master] Update wgLinterSubmitterWhitelist

https://gerrit.wikimedia.org/r/828786

Clement_Goubert changed the task status from Open to In Progress.Sep 1 2022, 9:27 AM
Clement_Goubert changed the status of subtask T307219: Put parse parse10[01-24] in production from Open to In Progress.

Change 828786 merged by jenkins-bot:

[operations/mediawiki-config@master] Update wgLinterSubmitterWhitelist

https://gerrit.wikimedia.org/r/828786

Mentioned in SAL (#wikimedia-operations) [2022-09-01T09:58:30Z] <cgoubert@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:828786|Update wgLinterSubmitterWhitelist (T312638)]] (duration: 03m 37s)

Mentioned in SAL (#wikimedia-operations) [2022-09-01T10:43:05Z] <claime> pooled parse1001.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638

Mentioned in SAL (#wikimedia-operations) [2022-09-01T10:58:38Z] <claime> pooled parse1002.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638

I will be moving production operations comments to https://phabricator.wikimedia.org/T307219 as not to clutter this task further

That's not correct. The cookbooks worked as expected and the downtime was properly set:

Sorry, @Volans. The timing made it look like it on IRC to me. I saw the cookbook run and the alert afterwards. This is even better if there is no issue.

Clement is aware of all of the above, we just had a downtime expire on us because of T316601. They are working with me on this.

ACK. My intention was just to share some info why it's called dsh and has an alert. I did definitely not mean to make it a big deal. The reason I chose the ticket over IRC was to stay async and avoid realtime pings or disrupt work. I'm sorry if that sounded different or like I thought the alert itself was important. Please disregard and carry on.

Currently none of the parse1* hosts is a canary server, is that intended to be changed at some point?

Currently none of the parse1* hosts is a canary server, is that intended to be changed at some point?

I will be working iteratively through the list in conftool-data/node/eqiad.yaml, looping around when I reach the end. When I replace wtp102[5-6].eqiad.wmnet (current canaries), I'll put parse100[1-2] as canaries.

Mentioned in SAL (#wikimedia-operations) [2022-09-05T11:16:19Z] <claime> pooled parse1003.eqiad.wmnet (php 7.4 only) in parsoid cluster https://phabricator.wikimedia.org/T312638

Icinga downtime and Alertmanager silence (ID=5982b372-9469-405c-a18d-48d12b854a91) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 3 host(s) and their services with reason: Downtiming replaced wtp servers

wtp[1034-1036].eqiad.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-09-05T11:53:52Z] <claime> pooled parse1004.eqiad.wmnet (php 7.4 only) in parsoid cluster T312638

Mentioned in SAL (#wikimedia-operations) [2022-09-05T12:14:10Z] <claime> depooled wtp1037.eqiad.wmnet from parsoid cluster T312638

Icinga downtime and Alertmanager silence (ID=83cf85ce-8731-463b-9d53-4500611c52ac) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 18 host(s) and their services with reason: Downtime pending inclusion in production

parse[1007-1024].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=3d4522be-4ec5-4d1a-8bba-1d5621e4d400) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 3 host(s) and their services with reason: Downtiming replace wtp servers

wtp[1036-1038].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=adaec891-8daa-470f-9d2f-6c2b62e7f043) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 2 host(s) and their services with reason: Downtiming replaced wtp servers

wtp[1039-1040].eqiad.wmnet

Icinga downtime and Alertmanager silence (ID=bdf6f9e0-2c64-471e-8b85-05f874724182) set by cgoubert@cumin1001 for 7 days, 0:00:00 on 3 host(s) and their services with reason: Downtiming replaced wtp servers

wtp[1041-1043].eqiad.wmnet

100% of parse traffic served in php 7.4

Old wtp servers now in the hands of DCops for decom.

https://grafana.wikimedia.org/d/000000048/parsoid-timing-wt2html?orgId=1&refresh=30s&from=now-90d&to=now&viewPanel=43 shows a dip in the time per output KB over the last few days which is a good sign.

NOTE: (1) This is 90-day view (2) This reports time per output KB which is impacted a bit less by request rate volumes as long as change in request rates don't significantly skew the page and page size distribution. (3) This is log10 scale on the Y-axis so the change is more prominent than it appears.

html2wt time per output KB also saw a dip although it is not as prominent as the wt2html one.

But, the p75 html2wt components graph makes it really obvious. This change is for all the components, even the small lines at the bottom (which shows up when you suppress the domdiff and serialize graphs).

Looks like maybe an approximately 30% speedup based on eyeballing the various plots in the full panel.

html2wt time per output KB also saw a dip although it is not as prominent as the wt2html one.

But, the p75 html2wt components graph makes it really obvious. This change is for all the components, even the small lines at the bottom (which shows up when you suppress the domdiff and serialize graphs).

Looks like maybe an approximately 30% speedup based on eyeballing the various plots in the full panel.

From the aggregated averages we also collect from apache I would say the performance gain is between 30 and 35%. It is due to both the new php version and the use of newer hardware.

I am mostly interested in looking at the effect on the number of parsoid timeouts.

https://logstash.wikimedia.org/goto/ad3709f398e3d85115d1d6088fa9e888

Screenshot 2022-09-13 at 12.55.40.png (682×2 px, 83 KB)

Last 30 days of parsoid timeouts, with migration steps added. It's still a small sample size, but it seems like there's a marked reduction of timeouts after completing the migration.