Page MenuHomePhabricator

Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers
Closed, ResolvedPublic

Description

If there is a bad patch deployed that needs to be reverted immediately, the revert is more important than the checks for php-fpm's cache ensuring server capacity for serving... HTTP 500 errors.

Original plan: Using the flag should Just Work™ and deploy as fast as possible for all scap sync-* commands.

Updated plan, per T243009#6527973, when the flag is set, we should do an immediate local restart of php-fpm, instead of the graceful depool/restart/repool script. It should use the same rolling batch mechanism as the regular restart, which should be fast enough.

Event Timeline

thcipriani moved this task from Needs triage to Debt on the Scap board.

Does the "maybe restart php-fpm" step take a noticable amount of time? I thought it was more or less instantaneous.

In any event, I'm not supportive of it being skipped because doing so makes our very fragile PHP7 deployment even harder to reason about or trust.

Our PHP7 deployment is hanging by a very thin thread due to the ability for opcache to get corrupted in way that it does not detect. See T224491 for examples. We have been very fortunate so far that the random "off by one" corruptions it has performed on primitive values such as string and bools, hasn't yet (as far as we know) caused loss of data or private data to become exposed but... I have absolutely no reason to assume that it won't lead to that. If we had come to this realisation earlier, I would've even recommended we stay on HHVM until this is fixed in better ways that we have done so far.

Note that despite our best efforts (e.g. predictive php-fpm restarts before we think opcache is about to shit itself...), we are still seeing occasional corruption from time to time (T245183). This wouldn't be too bad if it simply caused a segfault or other sane and fast failure, but as shown above that is not the case.

I understand that if a bad deploy happened we want to recover quickly, but can we quantify the number of seconds gained by skipping this (automated) step?

Does the "maybe restart php-fpm" step take a noticable amount of time? I thought it was more or less instantaneous.

The actual restart may be quick, but depooling the server prior to restart via conftool is slow.

How this works, currently:

  1. scap calls /usr/local/sbin/check-and-restart-php
  2. This is a script that calls the php7adm's opcache-info endpoint to dump the current size of the cache and determine if it's over the threshold in scap configuration
  3. Based on scap configuration, this script may decide to restart php, it does so via /usr/local/sbin/restart-php7.2-fpm which calls out to conftool's safe-service-restart.py script
  4. This script updates pybal and queries conftool in a loop until the update propagates to ensure that a service is depooled. Due to some limitation in pybal this is the slow step.

Additionally, we can't depool all of one type of server at once, so there are batches of server types that get depooled together. When we looked at this as part of the php7 upgrade (see T224857) I estimated that a sync-file would take ~11 minutes (instead of ~1minute) if we restarted all php-fpm for all appservers.

I have been persuaded that we need to restart php-fpm for every deployment (cf: T236104#5920394). However slow deployments become we need the ability to rollback fast.

The flip-side of cache corruption that may cause availability problems, data corruption, or security vulnerabilities is the need to rollback a change that is causing availability problems, data corruption, or security vulnerabilities -- that's the problem we're trying to solve with this change.

@thcipriani @LarsWirzenius would it be too much if we'd keep the current defaults, and introduce a flag like --skip-php-restart or --restart-php ?

RE: Having an option to take a shortcut, […] not restarting equates to effectively not having deployed any code.

However if we want to have a shortcut that doesn't do things in a rolling way, but immediately […] e.g. if most requests are http 5xx anyway, and if not having appserver capacity is handled gracefully higher in the traffic stack, then maybe an option for a bigger more risky batch could make sense. […]

I checked with SRE, and it looks like should not re-purpose restart-php7.2-fpm and safe-service-restart.py here. Doing so without batches, would result in more than 50% capacity being depooled. While that's conceptually expected here and should not be an issue, it becomes an issue because the PyBall loadbalancer between Varnish and MW will ignore pooling/weights if there is less than 50% capacity, which might cause requests to be sent anyway and not fail in a fast way.

Instead, it might make more sense to do restarts directly without any depool/repool. This only takes a second and would make the restarts succeed even quicker. In terms of failure mode during this, PyBall will notice the state checks passively very quickly through its healthcheck failing to respond within its tightly configured timeout.

Open questions (I guess for @Joe):

  • Would "simply" restarting php-fpm directly abort on-going requests? I assume so, and we're fine with this. Just checking so that we can document this as expected.
  • What would happen to requests routed to an appserver during the restart? Fail fast? If not, is there something at hand that you'd like to configure differently to make such reqs fail fast?
  • Is it worth putting some kind of very crude batching or rolling into this forced/emergency restart mode? Or is fanning out everywhere as fast as we can fine?

Open questions (I guess for @Joe):

  • Would "simply" restarting php-fpm directly abort on-going requests? I assume so, and we're fine with this. Just checking so that we can document this as expected.

Yes, restarting php-fpm will kill all in-flight requests

  • What would happen to requests routed to an appserver during the restart? Fail fast? If not, is there something at hand that you'd like to configure differently to make such reqs fail fast?

Yes, it would fail fast, and unless it's a POST, it will be retried by ATS.

  • Is it worth putting some kind of very crude batching or rolling into this forced/emergency restart mode? Or is fanning out everywhere as fast as we can fine?

I think the same batching we maintain for the non-force case should be fine.

The only difference is in that case we'd just have to run a different command, but that might need some permissions work.

Krinkle renamed this task from Make scap skip restarting php-fpm when using --force to Add option in Scap to skip restarting php-fpm for emergency deployments.Oct 20 2020, 5:36 PM
Krinkle updated the task description. (Show Details)

Open questions (I guess for @Joe):

  • Would "simply" restarting php-fpm directly abort on-going requests? I assume so, and we're fine with this. Just checking so that we can document this as expected.

Yes, restarting php-fpm will kill all in-flight requests

If the prod servers are doing 5k rps and restart takes 10 seconds to restart we drop some appreciable fraction (say half) of those requests during that window -- is that a reasonable way to think about this? 25K 500 errors every rollback?

If so, this might be a hard choice for deployers.

An example would be: logs for a PHP NOTICE are spiking following a deploy to group0 user impact is unknown -- would we do a fast rollback and serving 25K 500s on all wikis to resolve a NOTICE affecting a small fraction of users (and maybe not impacting them at all). This would be the calculus we'd need to do several times a week.

Proposal:

Revise /usr/local/bin/safe-service-restart to accept --force to make it skip depool/repool.

Revise /usr/local/sbin/check-and-restart-php to accept --force and
pass it along to /usr/local/bin/safe-service-restart as needed.

scap: when scap sync or scap sync-world is passed --force, pass
--force to php_fpm_restart_script (aka /usr/local/sbin/check-and-restart-php).

Change 635630 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[operations/puppet@production] Add --force flag to safe-service-restart.py

https://gerrit.wikimedia.org/r/635630

Change 635635 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[operations/puppet@production] Add --force flag to php-check-and-restart.sh

https://gerrit.wikimedia.org/r/635635

Open questions (I guess for @Joe):

  • Would "simply" restarting php-fpm directly abort on-going requests? I assume so, and we're fine with this. Just checking so that we can document this as expected.

Yes, restarting php-fpm will kill all in-flight requests

If the prod servers are doing 5k rps and restart takes 10 seconds to restart we drop some appreciable fraction (say half) of those requests during that window -- is that a reasonable way to think about this? 25K 500 errors every rollback?

I haven't checked the numbers, but yes, to me this seems acceptable given it's not only about how many requests we lose due to the restart, it's also how many we don't lose by recovering sooner rather than later.

If the reason for rollback isn't already causing e.g. many edits or page views to be lost, broken or corrupted, and possibly emitting HTTP 500, then there'd be no reason for a "emergency"-forced rollback, just a regular one.

If so, this might be a hard choice for deployers.

An example would be: logs for a PHP NOTICE are spiking following a deploy to group0 user impact is unknown -- would we do a fast rollback and serving 25K 500s on all wikis to resolve a NOTICE affecting a small fraction of users (and maybe not impacting them at all). This would be the calculus we'd need to do several times a week.

No, I don't think so. That's not an emergency-level rollback imho.

I would also recommend not calling it --force since that imho doesn't represent well what is happening. In the original plan we proposed that the emergency option would skip restarts (and thus, with opcache renewals disabled, this means some servers will keep using the old code). In the new plan, we're proposing to hard-restart without delay. The main impact of this is: reduced capacity, and aborting on-going requests. I may be biased by unrelated tools like git and rm, but I feel like --force sounds too casual and innocent for what this does. E.g. I can see someone rolling out a patch normally, not seeing the result, and thus doing it again with force "just in case". That'd be Bad™. Perhaps --emergency or --kill would be more appropiately scary.

--force sounds too casual and innocent for what this does. E.g. I can see someone rolling out a patch normally, not seeing the result, and thus doing it again with force "just in case". That'd be Bad™. Perhaps --emergency or --kill would be more appropiately scary.

Agreed. I was considering --unsafe.

--force sounds too casual and innocent for what this does. E.g. I can see someone rolling out a patch normally, not seeing the result, and thus doing it again with force "just in case". That'd be Bad™. Perhaps --emergency or --kill would be more appropiately scary.

Agreed. I was considering --unsafe.

I should clarify that my agreement extends to the two restart scripts.

scap sync already has --force to disable deployment safety nets.

We have these choices:

  • Extend scap sync --force's documented behavior to include unsafe restart and actively spread the news about this. This is what I prefer.
  • Rename --force to --emergency or whatever is decided. I've had the fortune of not having to use --force yet, so I don't have strong feelings about this.
  • Require an additional flag to request unsafe restart. I don't think anyone really wants this.

Proposal:

Revise /usr/local/bin/safe-service-restart to accept --force to make it skip depool/repool.

I think there is a simpler way to implement this, if we decide to add this feature. We can give the scap user permission to run systemctl restart php7.2-fpm, and have scap issue this command when the special flag (--force or --emergency or how we decide to call it) is used.

jijiki renamed this task from Add option in Scap to skip restarting php-fpm for emergency deployments to Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers.Oct 22 2020, 5:43 PM

Proposal: Revise /usr/local/bin/safe-service-restart to accept --force to make it skip depool/repool.

We can give the scap user permission to run systemctl restart php7.2-fpm, and have scap issue this command […]

+1 I believe this is what Joe meant, and makes sense to me as well.

Change 635897 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[mediawiki/tools/scap@master] scap: Add support for ungraceful php-fpm restart during --force sync

https://gerrit.wikimedia.org/r/635897

Change 635630 abandoned by Ahmon Dancy:
[operations/puppet@production] Add --force flag to safe-service-restart.py

Reason:
superseded by https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/ /635897

https://gerrit.wikimedia.org/r/635630

Change 635635 abandoned by Ahmon Dancy:
[operations/puppet@production] Add --force flag to php-check-and-restart.sh

Reason:
superseded by https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/ /635897

https://gerrit.wikimedia.org/r/635635

Proposal:

Revise /usr/local/bin/safe-service-restart to accept --force to make it skip depool/repool.

Revise /usr/local/sbin/check-and-restart-php to accept --force and
pass it along to /usr/local/bin/safe-service-restart as needed.

scap: when scap sync or scap sync-world is passed --force, pass
--force to php_fpm_restart_script (aka /usr/local/sbin/check-and-restart-php).

+1 seems like the best UI change overall.

I also want to add (for additional safeguard) poolcounter support to the safe restart script.

Change 635630 restored by Giuseppe Lavagetto:
[operations/puppet@production] Add --force flag to safe-service-restart.py

https://gerrit.wikimedia.org/r/635630

Sorry, I see that given I didn't express my thoughts linearly, some confusion ensued.

To better clarify my line of thought, for making the process of restarting php-fpm for every deployment, and adding a brute force capability, we should do the following things:

  1. safe-service-restart should add two options:
    1. --force that would not try to depool and just go for the brutal restart, with no further checks.
    2. --max-concurrency to be implemented with poolcounter, allowing at most N safe-restarts to run for the same software at the same time.
  2. Scap should move to use restart-php7.2-fpm instead of check-and-restart-php, and add '--force' when running with --force itself.
  3. We need to reduce the pause pybal does between etcd reads.

1A is implemented in the patch by @dancy that I restored
1B is implemented as a series of changes I created as dependent on that change. See https://gerrit.wikimedia.org/r/c/operations/puppet/+/635991. That list of patches still need some work and then testing.
2 I guess is up to @dancy (or whoever else!) to create the patches and release a new version of scap
3 Has a patch read, by yours truly, that needs to be merged and deployed

Once we have all those building blocks in place, we can think of how to test how this will work in production.

Relevant to this task are just points 1A and 2.

Addendum:
It could probably make sense to add a simple scap rolling-restart <cluster> <dc> command so that we can test everything works on the inactive datacenter, and while we're not releasing code. I guess this is requirement number 4 :P

I would also like for someone to investigate if systemctl reload php7.2-fpm clears opcache.

We have this configured to send a SIGUSR2 to the php-fpm leader process, which performs a graceful reload, allowing in-flight requests to complete.

Another thing we should configure is overriding systemd's KillSignal to be SIGQUIT instead of SIGTERM. Per docs, that's the signal that php-fpm accepts to ask for a graceful shutdown (whereas SIGTERM is an immediate one).

I would also like for someone to investigate if systemctl reload php7.2-fpm clears opcache.

IIRC it does, but we saw the same kind of corruption happening when issuing a reload than when invalidating it explicitly (that is, much more common than it is now with automatic revalidation).

We have this configured to send a SIGUSR2 to the php-fpm leader process, which performs a graceful reload, allowing in-flight requests to complete.

Another thing we should configure is overriding systemd's KillSignal to be SIGQUIT instead of SIGTERM. Per docs, that's the signal that php-fpm accepts to ask for a graceful shutdown (whereas SIGTERM is an immediate one).

I am ambivalent about it. Imagine there is a POST request hanging because of an outage, and the solution is to restart php-fpm. With SIGQUIT that would mean wait 200 seconds before the php daemon is available again. I prefer to keep a "nuclear" option - also a much faster one if we have to do a rolling restart - at the expense of a few failing requests. Most of those will be retried by the edge cache layer (or by envoy) anyways.

I would also like for someone to investigate if systemctl reload php7.2-fpm clears opcache.

I tested it, and it does.

We have this configured to send a SIGUSR2 to the php-fpm leader process, which performs a graceful reload, allowing in-flight requests to complete.

I will test that on Monday and see what happens.

Another thing we should configure is overriding systemd's KillSignal to be SIGQUIT instead of SIGTERM. Per docs, that's the signal that php-fpm accepts to ask for a graceful shutdown (whereas SIGTERM is an immediate one).

Why not, although we very rarely manually stop php-fpm. I can test that as well.

I would also like for someone to investigate if systemctl reload php7.2-fpm clears opcache.

IIRC it does, but we saw the same kind of corruption happening when issuing a reload than when invalidating it explicitly (that is, much more common than it is now with automatic revalidation).

I suggest we have one last go at it, and mark it as a complete failure if it causes corruptions.

We have this configured to send a SIGUSR2 to the php-fpm leader process, which performs a graceful reload, allowing in-flight requests to complete.

Another thing we should configure is overriding systemd's KillSignal to be SIGQUIT instead of SIGTERM. Per docs, that's the signal that php-fpm accepts to ask for a graceful shutdown (whereas SIGTERM is an immediate one).

I am ambivalent about it. Imagine there is a POST request hanging because of an outage, and the solution is to restart php-fpm. With SIGQUIT that would mean wait 200 seconds before the php daemon is available again. I prefer to keep a "nuclear" option - also a much faster one if we have to do a rolling restart - at the expense of a few failing requests. Most of those will be retried by the edge cache layer (or by envoy) anyways.

After discussing a bit more with chris, there is the process_control_timeout option is php-fpm.conf, to configure how much we will wait for a child to finish, which we may set a sensible number like 5s. Having a quick glance our p95s are in the range of 0.5-4s (some servers do peak at 8-9s ofc). Nevertheless, we can add yet another flag in the safe-service-restart for reloads.

Of course, none of this will matter if it is proved, once again, that reloads cause opcache corruptions :)

Change 636074 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[operations/puppet@production] modules/scap/templates/scap.cfg.erb: Update php_fpm_restart_script

https://gerrit.wikimedia.org/r/636074

  1. safe-service-restart should add two options:
    1. --force that would not try to depool and just go for the brutal restart, with no further checks.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/635630

  1. --max-concurrency to be implemented with poolcounter, allowing at most N safe-restarts to run for the same software at the same time.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/635991

  1. Scap should move to use restart-php7.2-fpm instead of check-and-restart-php, and add '--force' when running with --force itself.

https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/635897
https://gerrit.wikimedia.org/r/c/operations/puppet/+/636074

  1. We need to reduce the pause pybal does between etcd reads.

https://gerrit.wikimedia.org/r/c/operations/debs/pybal/+/631686

Addendum:
It could probably make sense to add a simple scap rolling-restart <cluster> <dc> command so that we can test everything works on the inactive datacenter, and while we're not releasing code. I guess this is requirement number 4 :P

If you definitely want this happen, please file a new ticket and assign it to me.

Change 635897 merged by jenkins-bot:
[mediawiki/tools/scap@master] scap: Add support for ungraceful php-fpm restart

https://gerrit.wikimedia.org/r/635897

Change 635630 merged by Giuseppe Lavagetto:
[operations/puppet@production] Add --force flag to safe-service-restart.py

https://gerrit.wikimedia.org/r/635630

I note that the Scap changes for this are already in Scap and in the currently deployed Scap release, 3.16.0.

Change 636074 merged by Giuseppe Lavagetto:
[operations/puppet@production] modules/scap/templates/scap.cfg.erb: Define php_fpm_unsafe_restart_script

https://gerrit.wikimedia.org/r/636074