Page MenuHomePhabricator

Update Scap to perform rolling restart for all MW deploy
Open, MediumPublic

Description

The outcome of T253673, is to go with idea 3 - rolling restarts.

Work:

Event Timeline

  1. Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.

I would like to do this -- it seems like it would solve a long-tail of issues that we've whack-a-mole'd -- but I have reservations about it.

The current estimate for the Scap rolling restart is 15 minutes. Is there low hanging fruit for reducing this?

My major worry isn't that deploys themselves take 15 minutes -- it's that rollbacks would also take 15 minutes. Waiting 15 minutes for a bad deploy to propagate and then waiting 15 minutes for it to be fixed is problematic (cf: T244544)

@Joe said today that a rolling restart takes 5-10 minutes, not 15 minutes. […]

RE: Having an option to take a shortcut, […] not restarting equates to effectively not having deployed any code.

However if we want to have a shortcut that doesn't do things in a rolling way, but immediately […] e.g. if most requests are http 5xx anyway, and if not having appserver capacity is handled gracefully higher in the traffic stack, then maybe an option for a bigger more risky batch could make sense. […]

I checked with SRE, and it looks like should not re-purpose restart-php7.2-fpm and safe-service-restart.py here. […]

Open questions (I guess for @Joe):

  • Would "simply" restarting php-fpm directly abort on-going requests? I assume so, and we're fine with this. Just checking so that we can document this as expected.

Yes, restarting php-fpm will kill all in-flight requests

  • What would happen to requests routed to an appserver during the restart? Fail fast? If not, is there something at hand that you'd like to configure differently to make such reqs fail fast?

Yes, it would fail fast, and unless it's a POST, it will be retried by ATS.

  • Is it worth putting some kind of very crude batching or rolling into this forced/emergency restart mode? Or is fanning out everywhere as fast as we can fine?

I think the same batching we maintain for the non-force case should be fine.

The only difference is in that case we'd just have to run a different command, but that might need some permissions work.

thcipriani added subscribers: dancy, LarsWirzenius.

Assigning per brief discussion in IRC. @dancy feel free to loop in others (@LarsWirzenius will be helpful for actual release) as needed.

thcipriani triaged this task as Medium priority.Oct 20 2020, 6:07 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.

I added what I think is an outline of the work we still need to do in order to make this process safe and efficient in T243009#6574045

Change 635991 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] safe-service-restart: add optional poolcounter support

https://gerrit.wikimedia.org/r/635991

Change 635992 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] poolcounter: add client configuration classes

https://gerrit.wikimedia.org/r/635992

Change 635993 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::lvs::realserver: add ability to configure poolcounter for pools

https://gerrit.wikimedia.org/r/635993

Change 635994 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] restbase: add poolcounter support to safe-service-restart scripts

https://gerrit.wikimedia.org/r/635994

Change 636074 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[operations/puppet@production] modules/scap/templates/scap.cfg.erb: Update php_fpm_restart_script

https://gerrit.wikimedia.org/r/636074

Change 635991 merged by Giuseppe Lavagetto:
[operations/puppet@production] safe-service-restart: add optional poolcounter support

https://gerrit.wikimedia.org/r/635991

Change 635992 merged by Giuseppe Lavagetto:
[operations/puppet@production] poolcounter: add client configuration classes

https://gerrit.wikimedia.org/r/635992

Change 635993 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::lvs::realserver: add ability to configure poolcounter for pools

https://gerrit.wikimedia.org/r/635993

Change 635994 merged by Giuseppe Lavagetto:
[operations/puppet@production] restbase: add poolcounter support to safe-service-restart scripts

https://gerrit.wikimedia.org/r/635994

Change 640928 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::lvs::realserver: use poolcounter for guarding service restarts

https://gerrit.wikimedia.org/r/640928

Change 640928 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::lvs::realserver: use poolcounter for guarding service restarts

https://gerrit.wikimedia.org/r/640928

https://gerrit.wikimedia.org/r/c/operations/puppet/+/636074 needs review and merge before this can be wrapped up.

Confirmed with @effie that this will roll out early this week, thus resolving sub task T243009.

I guess next is to prepare for actually using the new unconditional php-fpm rolling restart during regular Scap syncs.
What is left to do here? And do we want to coordinate with SRE the first time such a sync is done in production?

Krinkle updated the task description. (Show Details)

The feature flag has landed meanwhile. Highlights from T264362:

In T264362#6511242, @Joe wrote on 2 Oct 2020:

For the record, I just performed a rolling restart, using the same command as scap does, in eqiad for the appserver cluster - the largest we have. This took 2 minutes 14 seconds.

With a couple changes on the load balancers, that can be sped up significantly - currently pybal waits a grace time of 1 second […] which means that right now the lower limit to the rolling restart would take between 5 and 10 minutes […]. We can half that time by just reducing that sleep period on the loadbalancers.

In T264362#6602993, @gerritbot wrote on 5 Nov 2020:

Change 631686 merged by jenkins-bot:
[operations/debs/pybal@1.15] Reduce reconnectTimeout for etcd to 0.1 seconds

https://gerrit.wikimedia.org/r/631686

In T264362#6586586, @gerritbot wrote on 28 Oct 2020:

Change 635897 merged by jenkins-bot:
[mediawiki/tools/scap@master] scap: Add support for ungraceful php-fpm restart

https://gerrit.wikimedia.org/r/635897

So a full restart took ~ 2 minutes (and was before the reconnectTimeout reduction), which is pretty good. Not the 15 minutes we feared in T253673.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/636074 needs review and merge before this can be wrapped up.

Confirmed with @effie that this will roll out early this week, thus resolving sub task T243009.

I guess next is to prepare for actually using the new unconditional php-fpm rolling restart during regular Scap syncs.
What is left to do here? And do we want to coordinate with SRE the first time such a sync is done in production?

Yes please. I'd like someone from serviceops and someone from Release-Engineering-Team to be on hand for the first one of these. To either collaboratively celebrate or collaboratively troubleshoot.

Update from @Joe: The above Pyball change has been merged but not yet released. It will need to be released first to ensure that rolling restarts not only can take 2 minutes, but in fact consistently will take < 2 minutes by not allowing the current more liberal timeouts from stacking up.

Mentioned in SAL (#wikimedia-operations) [2021-01-28T10:33:53Z] <_joe_> performing a test-run of the rolling restart of php-fpm in codfw, using the same code scap will use T266055. Starting from the api cluster, then proceeding whith others

I have updated pybal in codfw this morning to do the first test run of what the rolling restart will take.

First results are, for a *sequential* run:

API: ~ 2 minutes 20 s
joburnner: ~ 3 minutes
Appserver: ~ 2 minutes 20 s

The results are only partially puzzling (jobrunners are just 18 server) - we do only restart servers in batches of 2. These times were pretty much in line with what I expected.

After doing that, I proceeded to launch the commands on the three pools at the same time, suspecting this could put some more strain on pybal. Sadly it resulted in the whole process taking about 5 minutes. Which I guess is not great but manageable (and btw, once we move to kubernetes, deployment times might very well exceed 5 minutes).

I'll do one more test, this time in eqiad, where I suspect the times might be slightly longer, but not much, mostly to check if we emit a lot of 5xx errors because of the rolling restart, and to check if there is any impact on performance.

I have one further question for scappists: what will happen if the restart script returns a non-zero return code? Does scap stop restarting whenever an error is found, or it does have a tolerance threshold?

Mentioned in SAL (#wikimedia-operations) [2021-01-28T11:01:10Z] <_joe_> restarting php-fpm on the appserver,api and jobrunner clusters in eqiad, 10% at a time, for simulating scap rolling restarts T266055

The rolling restart in eqiad happened before the fix to etcd was introduced, which caused some inconsistency in the "safety" of the restart, thus reducing the total runtimes.

But interestingly, while this was in good part a "forced" restart with the depooling/repooling often happening after the restart was done, this caused just a small uptick in the number of errors generated internally to 3 errors/s due to unreachability of the backend (over about 30k req/s). This is due to the fact that php-fpm restarts blazingly fast, more or less.

Also: user-facing impact was barely noticeable (the number of errors at the traffic layer is small, because of our retry policies): https://logstash.wikimedia.org/goto/07fb97ce7cee3af5eb2d16851175d727

Performance was largely unaffected on the backend as well - actually all latency percentiles show no significant change (see p99 below as an example)

In all, I feel confident that restart times will keep at or below 5 minutes (we have some knobs we can tweak), and that even when we'll have to force a deployment the effect won't be untolerable.

I have one further question for scappists: what will happen if the restart script returns a non-zero return code? Does scap stop restarting whenever an error is found, or it does have a tolerance threshold?

This part of Scap is not well known to me, so I'll write here what I understand so that others can fact check it. Scap gets the name of the script it runs to restart php-fpm from Scap's configuration, see file scap/php_fpm.py and the PHPRestart class. That ultimately calls utils.sudo_check_call, which raises subprocess.CalledProcessError if the script fails. The call to sudo_check_call seems to mainly happen via AbstractSync, a base class for all sync related Scap commands, and that uses canaries.

Best I can decipher the logic, the normal canary rules apply: if too many fail, the whole operation fails. All failures are logged in any case.

@thcipriani can you confirm?

I have one further question for scappists: what will happen if the restart script returns a non-zero return code? Does scap stop restarting whenever an error is found, or it does have a tolerance threshold?

This part of Scap is not well known to me, so I'll write here what I understand so that others can fact check it. Scap gets the name of the script it runs to restart php-fpm from Scap's configuration, see file scap/php_fpm.py and the PHPRestart class. That ultimately calls utils.sudo_check_call, which raises subprocess.CalledProcessError if the script fails. The call to sudo_check_call seems to mainly happen via AbstractSync, a base class for all sync related Scap commands, and that uses canaries.

Best I can decipher the logic, the normal canary rules apply: if too many fail, the whole operation fails. All failures are logged in any case.

@thcipriani can you confirm?

The PHPRestart class does call sudo_check-call, but it does a try/catch (probably because @Joe wrote that class :)): https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/php_fpm.py#85

After re-reading (but not testing just yet): if there are failures in that restart it will log a warning: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/main.py#522

But otherwise the process continues.

After doing that, I proceeded to launch the commands on the three pools at the same time, suspecting this could put some more strain on pybal. Sadly it resulted in the whole process taking about 5 minutes. Which I guess is not great but manageable (and btw, once we move to kubernetes, deployment times might very well exceed 5 minutes).

+1 -- 5 minutes for added safety is entirely tolerable.

I have one further question for scappists: what will happen if the restart script returns a non-zero return code? Does scap stop restarting whenever an error is found, or it does have a tolerance threshold?

This part of Scap is not well known to me, so I'll write here what I understand so that others can fact check it. Scap gets the name of the script it runs to restart php-fpm from Scap's configuration, see file scap/php_fpm.py and the PHPRestart class. That ultimately calls utils.sudo_check_call, which raises subprocess.CalledProcessError if the script fails. The call to sudo_check_call seems to mainly happen via AbstractSync, a base class for all sync related Scap commands, and that uses canaries.

Best I can decipher the logic, the normal canary rules apply: if too many fail, the whole operation fails. All failures are logged in any case.

@thcipriani can you confirm?

The PHPRestart class does call sudo_check-call, but it does a try/catch (probably because @Joe wrote that class :)): https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/php_fpm.py#85

After re-reading (but not testing just yet): if there are failures in that restart it will log a warning: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/main.py#522

But otherwise the process continues.

I think that is a sensible approach from the point of view of deployment. But it means we need to tweak the script we call from here needs to change a bit. Specifically, it should just ensure to pool the server at the end of the process, no matter what. The logic for the safe restart that we have now is that, upon failure, execution gets stopped and nothing else gets done.

That wouldn't work with a deployment, where the logic should be:

  • if server is not pooled, just restart (ok)
  • if the server is pooled and you cannot depool it, just restart and ensure the status in etcd is the one we want

We should not care too much about the pooled status after the restart, as we don't really need to wait for it IMHO.

I will try to create a more specific option for the restart script that will adopt the abovementioned logic.

One of the things I'm hopeful about after implementing the restart is that by restarting we remove the sync-order problem (after we verify restart works we can remove opcache mtime polling and just always restart). This means, suddenly, backports are much safer and require less experience and training when the order things land on the server is no longer a problem \o/

  • if the server is pooled and you cannot depool it, just restart and ensure the status in etcd is the one we want

If I understand correctly, this turns a regular deploy into effectively an emergency deploy if the regular one doesn't work for some reason. Thus both continuing to accept more new requests, and abruptly aborting both those and the ones already in-flight. Or does php-fpm have its own graceful restart as well? I don't recall whether pool/depool is an optimisation or improved safety method or actually the only thing that distinguishes the default graceful from the abrupt/emergency restart.

In any event, I guess continuing is fine for now especially given retries, but it might be worth printing at least a warning in such a way that propagates to the Scap operator so that they know to possibly look for impact, especially if a lot of servers unexpectedly got emergency restarted. I believe we have a warning mechanism already where e.g. canary checks if failing are not fatal to the scap deploy but do print a line saying that mw1234 had raised levels FYI.

  • if the server is pooled and you cannot depool it, just restart and ensure the status in etcd is the one we want

If I understand correctly, this turns a regular deploy into effectively an emergency deploy if the regular one doesn't work for some reason. Thus both continuing to accept more new requests, and abruptly aborting both those and the ones already in-flight. Or does php-fpm have its own graceful restart as well? I don't recall whether pool/depool is an optimisation or improved safety method or actually the only thing that distinguishes the default graceful from the abrupt/emergency restart.

This is a good question, I had assumed that depooling was necessary due to php-fpm not handling this gracefully, but I'd be interested in the answer here as well.

In any event, I guess continuing is fine for now especially given retries, but it might be worth printing at least a warning in such a way that propagates to the Scap operator so that they know to possibly look for impact, especially if a lot of servers unexpectedly got emergency restarted. I believe we have a warning mechanism already where e.g. canary checks if failing are not fatal to the scap deploy but do print a line saying that mw1234 had raised levels FYI.

Issuing a warning to the scap operator should be what happens currently if a machine's run of the check-opcache-limit-and-restart-if-needed script exits non-0. Differentiating a failed restart from a failed-depool+successful restart might be tricky on the scap side(?)

Differentiating a failed restart […]

Maybe it can (or already does) pass-through standard err?

Differentiating a failed restart […]

Maybe it can (or already does) pass-through standard err?

Likely could, currently warns: "X host had problems restarting php": https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/d441da54a48e6fdad03567864b6027f78764a7bf/scap/main.py#517

EDIT:
BUT(!) Now I see that the scap.Job abstraction that the RestartPHP class uses will report stderr to the deployer on non-zero exit code. If the script exits non-zero, the deployer will see the reason why.

  • if the server is pooled and you cannot depool it, just restart and ensure the status in etcd is the one we want

If I understand correctly, this turns a regular deploy into effectively an emergency deploy if the regular one doesn't work for some reason. Thus both continuing to accept more new requests, and abruptly aborting both those and the ones already in-flight. Or does php-fpm have its own graceful restart as well? I don't recall whether pool/depool is an optimisation or improved safety method or actually the only thing that distinguishes the default graceful from the abrupt/emergency restart.

The latter, but to clarify: this fallback should only happen in the eventuality etcd doesn't work or - worse - that pybal is lagging so much it can't keep up with pooling/depooling

In any event, I guess continuing is fine for now especially given retries, but it might be worth printing at least a warning in such a way that propagates to the Scap operator so that they know to possibly look for impact, especially if a lot of servers unexpectedly got emergency restarted. I believe we have a warning mechanism already where e.g. canary checks if failing are not fatal to the scap deploy but do print a line saying that mw1234 had raised levels FYI.

Unless there is some (unlikely) catastrophic failure somewhere else in the stack, this wouldn't happen. I just want to have a safety measure against events like the above. But anyways, scap should indeed report if any process returned a non-zero exit code, so we will know something went wrong.

But anyways, scap should indeed report if any process returned a non-zero exit code, so we will know something went wrong.

Okay, so we can either:

  • Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" via an emergency restart. It sounds like @Joe is suggesting this?
  • Modify Scap to show always display the stderr to deployers. Given that it already treats "errors" (non-zero exit) as "warnings" (in that they are not counted toward a threshold and continues regardless), we probably don't need to distinguish between non-zero exit and stderr success with some different signal in scap since in the end they just end up showing a warning and continuing regardless, so I suppose we could in that case widen Scap's "warning" trigger to flip based on stderr OR non-zero exit, instead of now where it fires only on stderr. I think this is what @thcipriani and I were thinking, but only because we assumed the script would have to exit as success/zero if the restart itself succeeded.

Okay, so we can either:

  • Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" with the actual restart. It sounds like @Joe is suggesting this?

This is the best path forward. I have confirmed that if the restart script on a host exits non-zero, scap will print a warning message (<command> on <host> returned [<exit status>]: <stdout and stderr text>) and move on.

@Joe Can we leave it to you to ensure that the restart script exits non-zero when it falls back to an emergency restart?

Okay, so we can either:

  • Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" with the actual restart. It sounds like @Joe is suggesting this?

This is the best path forward. I have confirmed that if the restart script on a host exits non-zero, scap will print a warning message (<command> on <host> returned [<exit status>]: <stdout and stderr text>) and move on.

@Joe Can we leave it to you to ensure that the restart script exits non-zero when it falls back to an emergency restart?

Not that I'm familiar with this code, but it looks like if poolcounter executes the error callback that returns boolean False it'll return an exit code of 2: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/conftool/files/safe-service-restart.py#388

So whatever the error callback we're specifying is it'll just need to return False

@Joe is that somewhere near what remains to try this?

A data point, in case it's useful. I just promoted train to group1 and Scap told me

15:19:10 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1316.eqiad.wmnet returned [2]: NOT restarting php7.2-fpm: free opcache 312 MB
Fragmentation is at 31%, nothing to do here
Restarting php7.2-fpm: Number of cached keys 27868 is over the 27531 limit
2021-03-03 15:19:10,525 [ERROR] Error running command with poolcounter: timed out

But it didn't error out in the end.

A data point, in case it's useful. I just promoted train to group1 and Scap told me

15:19:10 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1316.eqiad.wmnet returned [2]: NOT restarting php7.2-fpm: free opcache 312 MB
Fragmentation is at 31%, nothing to do here
Restarting php7.2-fpm: Number of cached keys 27868 is over the 27531 limit
2021-03-03 15:19:10,525 [ERROR] Error running command with poolcounter: timed out

But it didn't error out in the end.

This seems like validation that:

BUT(!) Now I see that the scap.Job abstraction that the RestartPHP class uses will report stderr to the deployer on non-zero exit code. If the script exits non-zero, the deployer will see the reason why.

Works :)

It doesn't look like there was an unsafe restart after failing to depool though.

Last restart of php-fpm on that machine was Wed 2021-03-03 16:27:21 (according to service php7.3-fpm status locally). Looking at logstash using "program:php7.2-fpm host:mw1316" it looks like that restart was the only restart that day -- unclear what triggered it (cron?)

@thcipriani @dancy we will try to get what is needed and get back to you, sorry for the delays

Change 682141 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] modules::conftool add safe-service-restart scap option

https://gerrit.wikimedia.org/r/682141

Looks like all that remains is to enable the feature-flag for a restart and see how long a deploy takes. @jijiki do you want to coordinate a specific time so you can watch closely, or should we just do it when we're sure people are around in case of explosion?