Page MenuHomePhabricator

Scap feature: restart php-fpm on deployment
Open, MediumPublic

Description

I'd like a feature flag that restarts php-fpm on every deployment (sync-world, sync-file, sync-wikiversions, etc). Turning on this feature would give us a better understanding of the time it takes to roll out MediaWiki if we do a restart (current estimates range from 5 minutes to 15 minutes).

This will help inform whether or not we can do a restart of php-fpm for *every* deployment to mitigate the possibility of opcache corruption.

Event Timeline

thcipriani triaged this task as Medium priority.
thcipriani created this task.
brennen moved this task from Backlog to Watching on the User-brennen board.
Joe added a comment.Fri, Oct 2, 6:47 AM

The feature is already there, you just need to have the check-opcache script run unconditionally, for instance by setting php_fpm_opcache_threshold to a very large number. See php-fpm.py in the scap repo.

The only thing we need to actually do is some testing of the feature to ensure:

1 - that it does what we expect
2 - that it doesn't cause user disruption of any kind

Mentioned in SAL (#wikimedia-operations) [2020-10-02T06:51:16Z] <_joe_> restarting php-fpm on all appservers in eqiad, in batches of 10%, for testing the procedure suggested at T264362

Joe added a comment.Fri, Oct 2, 6:57 AM

For the record, I just performed a rolling restart, using the same command as scap does, in eqiad for the appserver cluster - the largest we have. This took 2 minutes 14 seconds.

With a couple changes on the load balancers, that can be sped up significantly - currently pybal waits a grace time of 1 second after changing a value, which is probably way too long anyways. This means that right now the lower limit to the rolling restart would take between 5 and 10 minutes to restart everything. We can half that time by just reducing that sleep period on the loadbalancers.

Change 631686 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/debs/pybal@1.15] Reduce reconnectTimeout for etcd to 0.1 seconds

https://gerrit.wikimedia.org/r/631686

We should also keep in mind the error rate and latency during a rolling restart, in eqiad, without traffic, latecy increased (up to ~40%) for about 6'. I suspect there were some errors, but not enough to show up in the graphs.

We could do a full restart on codfw (just like we do when we push upgrades for php or php extensions), so we get a rough idea of the user impact we can expect for every code deployment.

Change 631776 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php_fpm: add feature flag to always restart

https://gerrit.wikimedia.org/r/631776

Change 631776 merged by jenkins-bot:
[mediawiki/tools/scap@master] php_fpm: add feature flag to always restart

https://gerrit.wikimedia.org/r/631776

Change 635897 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[mediawiki/tools/scap@master] scap: Always restart php-fpm

https://gerrit.wikimedia.org/r/635897