Page MenuHomePhabricator

Scap feature: restart php-fpm on deployment
Closed, ResolvedPublic

Description

I'd like a feature flag that restarts php-fpm on every deployment (sync-world, sync-file, sync-wikiversions, etc). Turning on this feature would give us a better understanding of the time it takes to roll out MediaWiki if we do a restart (current estimates range from 5 minutes to 15 minutes).

This will help inform whether or not we can do a restart of php-fpm for *every* deployment to mitigate the possibility of opcache corruption.

Event Timeline

thcipriani triaged this task as Medium priority.
thcipriani created this task.

The feature is already there, you just need to have the check-opcache script run unconditionally, for instance by setting php_fpm_opcache_threshold to a very large number. See php-fpm.py in the scap repo.

The only thing we need to actually do is some testing of the feature to ensure:

1 - that it does what we expect
2 - that it doesn't cause user disruption of any kind

Mentioned in SAL (#wikimedia-operations) [2020-10-02T06:51:16Z] <_joe_> restarting php-fpm on all appservers in eqiad, in batches of 10%, for testing the procedure suggested at T264362

For the record, I just performed a rolling restart, using the same command as scap does, in eqiad for the appserver cluster - the largest we have. This took 2 minutes 14 seconds.

With a couple changes on the load balancers, that can be sped up significantly - currently pybal waits a grace time of 1 second after changing a value, which is probably way too long anyways. This means that right now the lower limit to the rolling restart would take between 5 and 10 minutes to restart everything. We can half that time by just reducing that sleep period on the loadbalancers.

Change 631686 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/debs/pybal@1.15] Reduce reconnectTimeout for etcd to 0.1 seconds

https://gerrit.wikimedia.org/r/631686

We should also keep in mind the error rate and latency during a rolling restart, in eqiad, without traffic, latecy increased (up to ~40%) for about 6'. I suspect there were some errors, but not enough to show up in the graphs.

image.png (1×1 px, 270 KB)

We could do a full restart on codfw (just like we do when we push upgrades for php or php extensions), so we get a rough idea of the user impact we can expect for every code deployment.

Change 631776 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php_fpm: add feature flag to always restart

https://gerrit.wikimedia.org/r/631776

Change 631776 merged by jenkins-bot:
[mediawiki/tools/scap@master] php_fpm: add feature flag to always restart

https://gerrit.wikimedia.org/r/631776

Change 635897 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[mediawiki/tools/scap@master] scap: Always restart php-fpm

https://gerrit.wikimedia.org/r/635897

Change 635897 merged by jenkins-bot:
[mediawiki/tools/scap@master] scap: Add support for ungraceful php-fpm restart

https://gerrit.wikimedia.org/r/635897

Change 631686 merged by jenkins-bot:
[operations/debs/pybal@1.15] Reduce reconnectTimeout for etcd to 0.1 seconds

https://gerrit.wikimedia.org/r/631686

Am I understanding correctly that all the Scap changes for this task have been made now? What remains to be done that prevents us for declaring this task done?

Should we (RelEng) and SRE have a joint meeting to verify the changes to Scap work OK? I'll send an email.

This task appears to be a duplicate of T266055. I'll assume that this task is about adding the feature flag, and T266055 its parent for the roll out in production and any surrounding tweaks we may need etc.

Change 658964 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/debs/pybal@1.15-stretch] Reduce reconnectTimeout for etcd to 0.1 seconds, release 1.15.9

https://gerrit.wikimedia.org/r/658964

Change 658964 merged by jenkins-bot:
[operations/debs/pybal@1.15-stretch] Reduce reconnectTimeout for etcd to 0.1 seconds, release 1.15.9

https://gerrit.wikimedia.org/r/658964