Page MenuHomePhabricator

Update Scap to perform rolling restart for all MW deploy
Closed, ResolvedPublic

Description

The outcome of T253673, is to go with idea 3 - rolling restarts.

Work:

* [ ] Set opcache.validate_timestamps=0 in Beta Cluster app servers that do php-fpm restarts.

  • Set opcache.validate_timestamps=0 on all appserver clusters that do php-fpm restarts.

Out of scope for this task:

  • Deal with long-running jobs and maintenance scripts.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+19 -4
operations/puppetproduction+22 -18
mediawiki/tools/scapmaster+17 -0
mediawiki/tools/scapmaster+44 -56
operations/puppetproduction+9 -6
operations/puppetproduction+4 -0
operations/puppetproduction+3 -0
operations/puppetproduction+6 -4
mediawiki/tools/scapmaster+21 -13
mediawiki/tools/scapmaster+43 -225
mediawiki/tools/train-devmaster+1 -1
operations/puppetproduction+6 -4
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+3 -2
operations/puppetproduction+1 -0
operations/puppetproduction+38 -0
operations/puppetproduction+10 -12
operations/puppetproduction+12 -0
operations/puppetproduction+64 -4
operations/puppetproduction+84 -2
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

But anyways, scap should indeed report if any process returned a non-zero exit code, so we will know something went wrong.

Okay, so we can either:

  • Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" via an emergency restart. It sounds like @Joe is suggesting this?
  • Modify Scap to show always display the stderr to deployers. Given that it already treats "errors" (non-zero exit) as "warnings" (in that they are not counted toward a threshold and continues regardless), we probably don't need to distinguish between non-zero exit and stderr success with some different signal in scap since in the end they just end up showing a warning and continuing regardless, so I suppose we could in that case widen Scap's "warning" trigger to flip based on stderr OR non-zero exit, instead of now where it fires only on stderr. I think this is what @thcipriani and I were thinking, but only because we assumed the script would have to exit as success/zero if the restart itself succeeded.

Okay, so we can either:

  • Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" with the actual restart. It sounds like @Joe is suggesting this?

This is the best path forward. I have confirmed that if the restart script on a host exits non-zero, scap will print a warning message (<command> on <host> returned [<exit status>]: <stdout and stderr text>) and move on.

@Joe Can we leave it to you to ensure that the restart script exits non-zero when it falls back to an emergency restart?

Okay, so we can either:

  • Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" with the actual restart. It sounds like @Joe is suggesting this?

This is the best path forward. I have confirmed that if the restart script on a host exits non-zero, scap will print a warning message (<command> on <host> returned [<exit status>]: <stdout and stderr text>) and move on.

@Joe Can we leave it to you to ensure that the restart script exits non-zero when it falls back to an emergency restart?

Not that I'm familiar with this code, but it looks like if poolcounter executes the error callback that returns boolean False it'll return an exit code of 2: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/conftool/files/safe-service-restart.py#388

So whatever the error callback we're specifying is it'll just need to return False

@Joe is that somewhere near what remains to try this?

A data point, in case it's useful. I just promoted train to group1 and Scap told me

15:19:10 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1316.eqiad.wmnet returned [2]: NOT restarting php7.2-fpm: free opcache 312 MB
Fragmentation is at 31%, nothing to do here
Restarting php7.2-fpm: Number of cached keys 27868 is over the 27531 limit
2021-03-03 15:19:10,525 [ERROR] Error running command with poolcounter: timed out

But it didn't error out in the end.

A data point, in case it's useful. I just promoted train to group1 and Scap told me

15:19:10 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1316.eqiad.wmnet returned [2]: NOT restarting php7.2-fpm: free opcache 312 MB
Fragmentation is at 31%, nothing to do here
Restarting php7.2-fpm: Number of cached keys 27868 is over the 27531 limit
2021-03-03 15:19:10,525 [ERROR] Error running command with poolcounter: timed out

But it didn't error out in the end.

This seems like validation that:

BUT(!) Now I see that the scap.Job abstraction that the RestartPHP class uses will report stderr to the deployer on non-zero exit code. If the script exits non-zero, the deployer will see the reason why.

Works :)

It doesn't look like there was an unsafe restart after failing to depool though.

Last restart of php-fpm on that machine was Wed 2021-03-03 16:27:21 (according to service php7.3-fpm status locally). Looking at logstash using "program:php7.2-fpm host:mw1316" it looks like that restart was the only restart that day -- unclear what triggered it (cron?)

@thcipriani @dancy we will try to get what is needed and get back to you, sorry for the delays

Change 682141 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] modules::conftool add safe-service-restart scap option

https://gerrit.wikimedia.org/r/682141

Looks like all that remains is to enable the feature-flag for a restart and see how long a deploy takes. @jijiki do you want to coordinate a specific time so you can watch closely, or should we just do it when we're sure people are around in case of explosion?

Tomorrow at 16:00 UTC (9am US/Pacific) @dancy and I will be testing post-deployment-rolling-php-fpm-restart.

We will test by running scap sync-file README and verifying that rolling php-rpm restart runs smoothly. We will also be measuring the time it takes to complete the operation.

Though we expect success, we would appreciate support coverage from SRE while we run the test to help recover in case something goes terribly wrong. Thanks in advance!

We ran the test today. @jijiki supplied SRE backup.

It ran in two phases:

Phase 1: php_fpm_always_restart: false (its usual state at the moment)
scap sync-file README
Total time: 1m19.942s

php_fpm_always_restart: true:
scap sync-file README
Total time: 3m12.836s

After the test it was noted in #wikimedia-operations that there was a small spike of HTTP 500 errors during phase 1 and a larger spike during phase 2. A discussion ensued and it was concluded that this is expected, unavoidable, and an acceptable outcome.

Change 719307 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] ::profile::mediawiki::common.pp: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force

https://gerrit.wikimedia.org/r/719307

Change 719307 merged by Giuseppe Lavagetto:

[operations/puppet@production] Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force

https://gerrit.wikimedia.org/r/719307

We ran the test today. @jijiki supplied SRE backup.

It ran in two phases:

Phase 1: php_fpm_always_restart: false (its usual state at the moment)
scap sync-file README
Total time: 1m19.942s

php_fpm_always_restart: true:
scap sync-file README
Total time: 3m12.836s

After the test it was noted in #wikimedia-operations that there was a small spike of HTTP 500 errors during phase 1 and a larger spike during phase 2. A discussion ensued and it was concluded that this is expected, unavoidable, and an acceptable outcome.

It is a surprise to me that without restart there would be an increase in "normal" HTTP 500 errors during the deploy. I'm not aware of this being the status quo or what causal link that would have.

A slight increase in errors with restarts seems expected in principle given that we can't ensure syncs complete within a given time frame (e.g. 3 minutes) in a way that has restarted all servers to serve new code and has not killed requestst that take longer than 1-2 minutes to complete. So some amount of requests cancellation seems expected, especially for long-running jobs and maintenance scripts.

But the amount of cancellation we saw, and the explanation provided by SRE, suggests that we are cancelling all on-going requests immediately when we start the depool/restart sequence for a particular server. This seems surprising to me, both because it seems easy to avoid and has fairly high end-user impact, and because conceptually the default sequence feels like a graceful restart, to contrast --force which is not enabled by default.

My understanding, roughly, is as follows, for the default "graceful" restart:

  • await depooling server.
  • await threshold for on-going requests (e.g. 1 second, which is >95th percentile of appserver latency).
  • restart php-fpm.
  • repool.

Also, given the use of poolcounter for the rolling restart and how the internal Scap code is organised, I'm assuming the code is aware of each server finishing the sequence, not just when a batch of N servers finishes the sequence. As such, if not already, it seems like it could be improved by doing a rolling restart in a way that resembles the idea of rolling a round wheel, rather than a kicking a square. Meaning, don't wait for outliers in PyBall or php-fpm commands to continue the next chunk when some are done. That should speed it up overall as well.

Having said that, the sync time is afaik not a blocker since it is already quite low (3min) which we previously accepted as being fast enough.

I mentioned this to @jijiki a few weeks ago, and she agreed a 1s threshold makes sense and should cover the majority of this.

One other thing that comes to mind is what we do with jobrunners and maintenance hosts.

jobrunners: I'm guessing these are included since they're straight forward php-fpm sync targets from Scap's perspective (despite being more CLI-like in terms of how we use it). Killing on-going jobs for every minor sync and deployment should be fine in principle, but it's certainly going to stress test how well everything is put together. It also emphasises the technical debt we have where some jobs are still not re-tryable and while that's fairly obscure today with jobs almost never being killed randomly, it'd be a lot more common if we start doing this. I don't have a good sense of this right now in order to assess the risk and likelihood of data loss (e.g. undelivered emails/notifications, outdated categories/search, pages not deleted). We could wing it and see what happens. Or we could take the oppertunity to audit job definitions and see which ones don't support retries and inform their owners and ask them to decide whether and what they want to do about it and whether they think it is urgent enough to block.

maintenance hosts: I'm guessing these are not affected by current restart logic, unless someone wrote code for killing on-going PHP CLI processes that are not managed by php-fpm. I'm guessing this is fine and we don't need to kill those?

Also, given the use of poolcounter for the rolling restart and how the internal Scap code is organised, I'm assuming the code is aware of each server finishing the sequence, not just when a batch of N servers finishes the sequence. As such, if not already, it seems like it could be improved by doing a rolling restart in a way that resembles the idea of rolling a round wheel, rather than a kicking a square. Meaning, don't wait for outliers in PyBall or php-fpm commands to continue the next chunk when some are done. That should speed it up overall as well.

I can confirm that scap just limits the maximum number of concurrent tasks. It does not perform tasks in discrete batches.

The test command for the record: scap sync-file -D php_fpm_always_restart:true README

We ran the test today. @jijiki supplied SRE backup.

It ran in two phases:

Phase 1: php_fpm_always_restart: false (its usual state at the moment)
scap sync-file README
Total time: 1m19.942s

php_fpm_always_restart: true:
scap sync-file README
Total time: 3m12.836s

After the test it was noted in #wikimedia-operations that there was a small spike of HTTP 500 errors during phase 1 and a larger spike during phase 2. A discussion ensued and it was concluded that this is expected, unavoidable, and an acceptable outcome.

It is a surprise to me that without restart there would be an increase in "normal" HTTP 500 errors during the deploy. I'm not aware of this being the status quo or what causal link that would have.

Depending on the nature of the deployment, yes it's normal to get a few 5xx errors - usually reported by the appservers, but they won't reach the user, unless it's an edit.

A slight increase in errors with restarts seems expected in principle given that we can't ensure syncs complete within a given time frame (e.g. 3 minutes) in a way that has restarted all servers to serve new code and has not killed requestst that take longer than 1-2 minutes to complete. So some amount of requests cancellation seems expected, especially for long-running jobs and maintenance scripts.

For maintenance scripts, no. But this could indeed prove problematic for videoscaling of large files. It's a problem we have to solve for k8s anyways.

But the amount of cancellation we saw, and the explanation provided by SRE, suggests that we are cancelling all on-going requests immediately when we start the depool/restart sequence for a particular server.

I just found the problem, which frankly is baffling to me - apparently the script we're using to restart from scap doesn't declare --grace-period as an argument of safe-service-restart, which is what determines the number of seconds we wait before restarting php-fpm. I am just going to change the default for safe-service-restart for now, because leaving it to zero seconds almost never seems like a good idea to me.

Change 752600 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] safe-service-restart: make the default grace period 3 seconds

https://gerrit.wikimedia.org/r/752600

Change 752600 merged by Giuseppe Lavagetto:

[operations/puppet@production] safe-service-restart: make the default grace period 3 seconds

https://gerrit.wikimedia.org/r/752600

Retest after the 3 second grace period commit was merged:

dancy@deploy1002:/srv/mediawiki-staging$ scap sync-file -D php_fpm_always_restart:true README "Testing php-fpm restart"
....
16:13:10 Started sync-apaches
sync-apaches: 100% (ok: 347; fail: 0; left: 0)
16:13:19 Finished sync-apaches (duration: 00m 08s)
16:13:19 Started php-fpm-restarts
16:13:19 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 352 host(s)
16:15:45 Finished php-fpm-restarts (duration: 02m 25s)
16:15:45 Synchronized README: Testing php-fpm restart (duration: 03m 18s)

@Joe monitored https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=63&orgId=1&from=1642175067857&to=1642179176551 during the operation. The peak 503 error rate during the restart phase was 0.433 503's per second.

dancy updated the task description. (Show Details)

Hi @Joe . Can you update the description of this report with the remaining areas of work?

Change 792980 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap: do not restart jobrunners on deployment

https://gerrit.wikimedia.org/r/792980

Change 792981 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap: enable restarting php-fpm on deployment

https://gerrit.wikimedia.org/r/792981

Change 792982 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki::php: check opcache revalidation in restart script

https://gerrit.wikimedia.org/r/792982

Change 792983 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki_canaries: disable opcache revalidation

https://gerrit.wikimedia.org/r/792983

Change 792984 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki: disable revalidation everywhere

https://gerrit.wikimedia.org/r/792984

Change 792980 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap: do not restart jobrunners on deployment

https://gerrit.wikimedia.org/r/792980

Change 792981 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap: enable restarting php-fpm on deployment

https://gerrit.wikimedia.org/r/792981

Mentioned in SAL (#wikimedia-operations) [2022-05-30T08:42:41Z] <oblivian@deploy1002> Synchronized README: testing php restarts with scap, T266055 (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2022-05-30T09:53:34Z] <oblivian@deploy1002> Synchronized README: testing php restarts with scap, T266055 (duration: 03m 30s)

Change 792983 merged by Giuseppe Lavagetto:

[operations/puppet@production] mediawiki_canaries: disable opcache revalidation

https://gerrit.wikimedia.org/r/792983

Current status is:

  • Each deployment will restart php-fpm
  • api and appserver canaries have opcache revalidation turned off
  • Tomorrow we'll turn it off for all appservers as well.

This amongst other things means that until we do the restart, the new code will mostly not be available, making deployments slightly more atomic.

Change 801773 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/train-dev@master] Set php_fpm_always_restart to True in deploy:/etc/scap.cfg

https://gerrit.wikimedia.org/r/801773

Change 801773 merged by jenkins-bot:

[mediawiki/tools/train-dev@master] Set php_fpm_always_restart to True in deploy:/etc/scap.cfg

https://gerrit.wikimedia.org/r/801773

Change 801767 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/tools/scap@master] Restart canaries before checking, remove the opcache invalidation function

https://gerrit.wikimedia.org/r/801767

Change 801767 merged by jenkins-bot:

[mediawiki/tools/scap@master] Restart canaries before checking, remove the opcache invalidation function

https://gerrit.wikimedia.org/r/801767

Change 801807 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Don't attempt php fpm restart if not configured

https://gerrit.wikimedia.org/r/801807

Change 801807 merged by jenkins-bot:

[mediawiki/tools/scap@master] Don't attempt php fpm restart if not configured

https://gerrit.wikimedia.org/r/801807

Current status is:

  • Each deployment will restart php-fpm
  • api and appserver canaries have opcache revalidation turned off
  • Tomorrow we'll turn it off for all appservers as well.

This amongst other things means that until we do the restart, the new code will mostly not be available, making deployments slightly more atomic.

This did not exactly happen/was reverted because of a missing patch in scap. Now that scap 4.8.2 is rolled out (T309116).
I will tun off opcache revalidation on api/appserver cancaries now and potentially do all appservers tomorrow if everything looks fine.

Change 802134 had a related patch set uploaded (by JMeybohm; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki: stop revalidating opcache on canaries

https://gerrit.wikimedia.org/r/802134

Change 802134 merged by JMeybohm:

[operations/puppet@production] mediawiki: stop revalidating opcache on canaries

https://gerrit.wikimedia.org/r/802134

Mentioned in SAL (#wikimedia-operations) [2022-06-07T11:04:47Z] <jayme> restarting php-fpm on api and appserver canaries - T266055

Change 803955 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/803955

Change 803955 merged by Dzahn:

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/803955

Change 804441 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/804441

Change 804441 merged by Dzahn:

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/804441

Hey @dancy, do you think we're good to proceed with disabling revalidation for all of api,app,parsoid ?

Hey @dancy, do you think we're good to proceed with disabling revalidation for all of api,app,parsoid ?

Yes. Having opcache.validate_timestamps=0 on the canaries doesn't seem to have resulted in any problems.

Added to task description, and ticked off the first:

  • Set opcache.validate_timestamps=0 on prod canaries.
  • Set opcache.validate_timestamps=0 in Beta Cluster app servers that do php-fpm restarts.
  • Set opcache.validate_timestamps=0 on all appserver clusters that do php-fpm restarts.

I've moved the item about jobrunner to "Out of scope" to reflect the above decision. Changing jobrunner does not block the parent task.

Change 792984 merged by JMeybohm:

[operations/puppet@production] mediawiki: disable revalidation for api,app,parsoid clusters

https://gerrit.wikimedia.org/r/792984

Mentioned in SAL (#wikimedia-operations) [2022-06-23T10:25:18Z] <jayme> running restart-php7.2-fpm A:parsoid or A:mw or A:mw-api to disable opcache revalidation - T266055

The ones left in prod are: mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet

Change 811373 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Clean up php fpm restart

https://gerrit.wikimedia.org/r/811373

Change 811373 merged by jenkins-bot:

[mediawiki/tools/scap@master] Clean up php fpm restart

https://gerrit.wikimedia.org/r/811373

Joe updated the task description. (Show Details)

Change 825873 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Add "scap php-fpm-restart"

https://gerrit.wikimedia.org/r/825873

Change 825873 merged by jenkins-bot:

[mediawiki/tools/scap@master] Add "scap php-fpm-restart"

https://gerrit.wikimedia.org/r/825873

Change 682141 abandoned by Effie Mouzeli:

[operations/puppet@production] modules::conftool add safe-service-restart scap option

Reason:

not relevant any more

https://gerrit.wikimedia.org/r/682141