Scap feature: restart php-fpm on deployment
Closed, ResolvedPublic
Actions

Description

I'd like a feature flag that restarts php-fpm on every deployment (sync-world, sync-file, sync-wikiversions, etc). Turning on this feature would give us a better understanding of the time it takes to roll out MediaWiki if we do a restart (current estimates range from 5 minutes to 15 minutes).

This will help inform whether or not we can do a restart of php-fpm for *every* deployment to mitigate the possibility of opcache corruption.

Details

Subject	Repo	Branch	Lines +/-
Reduce reconnectTimeout for etcd to 0.1 seconds, release 1.15.9	operations/debs/pybal	1.15-stretch	+7 -1
Reduce reconnectTimeout for etcd to 0.1 seconds	operations/debs/pybal	1.15	+1 -1
scap: Add support for ungraceful php-fpm restart	mediawiki/tools/scap	master	+52 -8
php_fpm: add feature flag to always restart	mediawiki/tools/scap	master	+26 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Krinkle	T212460 Adopt static array files for local disk storage of values (epic)
Open	None	T99740 Use static php array files for l10n cache at WMF (instead of CDB)
Resolved	Krinkle	T245183 PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.)
Resolved	Krinkle	T253673 Avoid php-opcache corruption in WMF production
Resolved	Joe	T266055 Update Scap to perform rolling restart for all MW deploy
Resolved	thcipriani	T264362 Scap feature: restart php-fpm on deployment

Event Timeline

thcipriani claimed this task.Oct 1 2020, 8:46 PM

thcipriani triaged this task as Medium priority.

thcipriani created this task.

thcipriani moved this task from INBOX to Maintenance on the Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)) board.

brennen added a project: User-brennen.Oct 1 2020, 8:55 PM

brennen moved this task from Backlog to Radar on the User-brennen board.

The feature is already there, you just need to have the check-opcache script run unconditionally, for instance by setting php_fpm_opcache_threshold to a very large number. See php-fpm.py in the scap repo.

The only thing we need to actually do is some testing of the feature to ensure:

1 - that it does what we expect
2 - that it doesn't cause user disruption of any kind

Mentioned in SAL (#wikimedia-operations) [2020-10-02T06:51:16Z] <_joe_> restarting php-fpm on all appservers in eqiad, in batches of 10%, for testing the procedure suggested at T264362

For the record, I just performed a rolling restart, using the same command as scap does, in eqiad for the appserver cluster - the largest we have. This took 2 minutes 14 seconds.

With a couple changes on the load balancers, that can be sped up significantly - currently pybal waits a grace time of 1 second after changing a value, which is probably way too long anyways. This means that right now the lower limit to the rolling restart would take between 5 and 10 minutes to restart everything. We can half that time by just reducing that sleep period on the loadbalancers.

Change 631686 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/debs/pybal@1.15] Reduce reconnectTimeout for etcd to 0.1 seconds

https://gerrit.wikimedia.org/r/631686

gerritbot added a project: Patch-For-Review.Oct 2 2020, 7:08 AM

MoritzMuehlenhoff subscribed.Oct 2 2020, 7:39 AM

We should also keep in mind the error rate and latency during a rolling restart, in eqiad, without traffic, latecy increased (up to ~40%) for about 6'. I suspect there were some errors, but not enough to show up in the graphs.

We could do a full restart on codfw (just like we do when we push upgrades for php or php extensions), so we get a rough idea of the user impact we can expect for every code deployment.

Change 631776 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php_fpm: add feature flag to always restart

https://gerrit.wikimedia.org/r/631776

Change 631776 merged by jenkins-bot:
[mediawiki/tools/scap@master] php_fpm: add feature flag to always restart

https://gerrit.wikimedia.org/r/631776

Change 635897 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[mediawiki/tools/scap@master] scap: Always restart php-fpm

https://gerrit.wikimedia.org/r/635897

Change 635897 merged by jenkins-bot:
[mediawiki/tools/scap@master] scap: Add support for ungraceful php-fpm restart

https://gerrit.wikimedia.org/r/635897

Change 631686 merged by jenkins-bot:
[operations/debs/pybal@1.15] Reduce reconnectTimeout for etcd to 0.1 seconds

https://gerrit.wikimedia.org/r/631686

• LarsWirzenius subscribed.Dec 21 2020, 2:48 PM

Am I understanding correctly that all the Scap changes for this task have been made now? What remains to be done that prevents us for declaring this task done?

jijiki mentioned this in T245183: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.).Jan 6 2021, 1:54 PM

Should we (RelEng) and SRE have a joint meeting to verify the changes to Scap work OK? I'll send an email.

This task appears to be a duplicate of T266055. I'll assume that this task is about adding the feature flag, and T266055 its parent for the roll out in production and any surrounding tweaks we may need etc.

Krinkle mentioned this in T266055: Update Scap to perform rolling restart for all MW deploy.Jan 26 2021, 9:07 PM

Change 658964 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/debs/pybal@1.15-stretch] Reduce reconnectTimeout for etcd to 0.1 seconds, release 1.15.9

https://gerrit.wikimedia.org/r/658964

Change 658964 merged by jenkins-bot:
[operations/debs/pybal@1.15-stretch] Reduce reconnectTimeout for etcd to 0.1 seconds, release 1.15.9