Page MenuHomePhabricator

Update Scap to perform rolling restart for all MW deploy
Open, MediumPublic

Description

The outcome of T253673, is to go with idea 3 - rolling restarts.

Work:

  • Patch Scap to do the rolling restart always (instead of only every once in a while, currently based on unreliable opcache thresholds) -- https://gerrit.wikimedia.org/r/631776
  • Patch Scap to implement an emergency flag that will perform this restart in a way that does not ensure live server capacity, in case of a bad patch having taken down the site in large part. Tracked by T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers
  • Deploy updated Scap to Beta Cluster -- https://integration.wikimedia.org/ci/job/scap-beta-deb/118/
  • Package Scap for production aptitude.
  • Deploy updated Scap to production.
  • Measure timing before and after, and report on task.
    • Test results:
    • With php_fpm_always_restart: false, scap sync-file README takes 1m19.942s
    • With php_fpm_always_restart: true, scap sync-file README takes 3m12.836s
  • Set opcache.validate_timestamps=0 on prod canaries.
  • Set opcache.validate_timestamps=0 in Beta Cluster app servers that do php-fpm restarts.
  • Set opcache.validate_timestamps=0 on all appserver clusters that do php-fpm restarts.

Out of scope for this task:

  • Deal with long-running jobs and maintenance scripts.

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+9 -6
operations/puppetproduction+4 -0
operations/puppetproduction+3 -0
operations/puppetproduction+6 -4
mediawiki/tools/scapmaster+21 -13
mediawiki/tools/scapmaster+43 -225
mediawiki/tools/train-devmaster+1 -1
operations/puppetproduction+6 -4
operations/puppetproduction+22 -18
operations/puppetproduction+2 -0
operations/puppetproduction+1 -1
operations/puppetproduction+3 -2
operations/puppetproduction+1 -0
operations/puppetproduction+19 -4
operations/puppetproduction+38 -0
operations/puppetproduction+10 -12
operations/puppetproduction+12 -0
operations/puppetproduction+64 -4
operations/puppetproduction+84 -2
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Update from @Joe: The above Pyball change has been merged but not yet released. It will need to be released first to ensure that rolling restarts not only can take 2 minutes, but in fact consistently will take < 2 minutes by not allowing the current more liberal timeouts from stacking up.

Mentioned in SAL (#wikimedia-operations) [2021-01-28T10:33:53Z] <_joe_> performing a test-run of the rolling restart of php-fpm in codfw, using the same code scap will use T266055. Starting from the api cluster, then proceeding whith others

I have updated pybal in codfw this morning to do the first test run of what the rolling restart will take.

First results are, for a *sequential* run:

API: ~ 2 minutes 20 s
joburnner: ~ 3 minutes
Appserver: ~ 2 minutes 20 s

The results are only partially puzzling (jobrunners are just 18 server) - we do only restart servers in batches of 2. These times were pretty much in line with what I expected.

After doing that, I proceeded to launch the commands on the three pools at the same time, suspecting this could put some more strain on pybal. Sadly it resulted in the whole process taking about 5 minutes. Which I guess is not great but manageable (and btw, once we move to kubernetes, deployment times might very well exceed 5 minutes).

I'll do one more test, this time in eqiad, where I suspect the times might be slightly longer, but not much, mostly to check if we emit a lot of 5xx errors because of the rolling restart, and to check if there is any impact on performance.

I have one further question for scappists: what will happen if the restart script returns a non-zero return code? Does scap stop restarting whenever an error is found, or it does have a tolerance threshold?

Mentioned in SAL (#wikimedia-operations) [2021-01-28T11:01:10Z] <_joe_> restarting php-fpm on the appserver,api and jobrunner clusters in eqiad, 10% at a time, for simulating scap rolling restarts T266055

The rolling restart in eqiad happened before the fix to etcd was introduced, which caused some inconsistency in the "safety" of the restart, thus reducing the total runtimes.

But interestingly, while this was in good part a "forced" restart with the depooling/repooling often happening after the restart was done, this caused just a small uptick in the number of errors generated internally to 3 errors/s due to unreachability of the backend (over about 30k req/s). This is due to the fact that php-fpm restarts blazingly fast, more or less.

php-restarts-old-pybal.png (320×1 px, 21 KB)

Also: user-facing impact was barely noticeable (the number of errors at the traffic layer is small, because of our retry policies): https://logstash.wikimedia.org/goto/07fb97ce7cee3af5eb2d16851175d727

Performance was largely unaffected on the backend as well - actually all latency percentiles show no significant change (see p99 below as an example)

latency-p99-php-restarts.png (322×1 px, 33 KB)

In all, I feel confident that restart times will keep at or below 5 minutes (we have some knobs we can tweak), and that even when we'll have to force a deployment the effect won't be untolerable.

I have one further question for scappists: what will happen if the restart script returns a non-zero return code? Does scap stop restarting whenever an error is found, or it does have a tolerance threshold?

This part of Scap is not well known to me, so I'll write here what I understand so that others can fact check it. Scap gets the name of the script it runs to restart php-fpm from Scap's configuration, see file scap/php_fpm.py and the PHPRestart class. That ultimately calls utils.sudo_check_call, which raises subprocess.CalledProcessError if the script fails. The call to sudo_check_call seems to mainly happen via AbstractSync, a base class for all sync related Scap commands, and that uses canaries.

Best I can decipher the logic, the normal canary rules apply: if too many fail, the whole operation fails. All failures are logged in any case.

@thcipriani can you confirm?

I have one further question for scappists: what will happen if the restart script returns a non-zero return code? Does scap stop restarting whenever an error is found, or it does have a tolerance threshold?

This part of Scap is not well known to me, so I'll write here what I understand so that others can fact check it. Scap gets the name of the script it runs to restart php-fpm from Scap's configuration, see file scap/php_fpm.py and the PHPRestart class. That ultimately calls utils.sudo_check_call, which raises subprocess.CalledProcessError if the script fails. The call to sudo_check_call seems to mainly happen via AbstractSync, a base class for all sync related Scap commands, and that uses canaries.

Best I can decipher the logic, the normal canary rules apply: if too many fail, the whole operation fails. All failures are logged in any case.

@thcipriani can you confirm?

The PHPRestart class does call sudo_check-call, but it does a try/catch (probably because @Joe wrote that class :)): https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/php_fpm.py#85

After re-reading (but not testing just yet): if there are failures in that restart it will log a warning: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/main.py#522

But otherwise the process continues.

After doing that, I proceeded to launch the commands on the three pools at the same time, suspecting this could put some more strain on pybal. Sadly it resulted in the whole process taking about 5 minutes. Which I guess is not great but manageable (and btw, once we move to kubernetes, deployment times might very well exceed 5 minutes).

+1 -- 5 minutes for added safety is entirely tolerable.

I have one further question for scappists: what will happen if the restart script returns a non-zero return code? Does scap stop restarting whenever an error is found, or it does have a tolerance threshold?

This part of Scap is not well known to me, so I'll write here what I understand so that others can fact check it. Scap gets the name of the script it runs to restart php-fpm from Scap's configuration, see file scap/php_fpm.py and the PHPRestart class. That ultimately calls utils.sudo_check_call, which raises subprocess.CalledProcessError if the script fails. The call to sudo_check_call seems to mainly happen via AbstractSync, a base class for all sync related Scap commands, and that uses canaries.

Best I can decipher the logic, the normal canary rules apply: if too many fail, the whole operation fails. All failures are logged in any case.

@thcipriani can you confirm?

The PHPRestart class does call sudo_check-call, but it does a try/catch (probably because @Joe wrote that class :)): https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/php_fpm.py#85

After re-reading (but not testing just yet): if there are failures in that restart it will log a warning: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/main.py#522

But otherwise the process continues.

I think that is a sensible approach from the point of view of deployment. But it means we need to tweak the script we call from here needs to change a bit. Specifically, it should just ensure to pool the server at the end of the process, no matter what. The logic for the safe restart that we have now is that, upon failure, execution gets stopped and nothing else gets done.

That wouldn't work with a deployment, where the logic should be:

  • if server is not pooled, just restart (ok)
  • if the server is pooled and you cannot depool it, just restart and ensure the status in etcd is the one we want

We should not care too much about the pooled status after the restart, as we don't really need to wait for it IMHO.

I will try to create a more specific option for the restart script that will adopt the abovementioned logic.

One of the things I'm hopeful about after implementing the restart is that by restarting we remove the sync-order problem (after we verify restart works we can remove opcache mtime polling and just always restart). This means, suddenly, backports are much safer and require less experience and training when the order things land on the server is no longer a problem \o/

  • if the server is pooled and you cannot depool it, just restart and ensure the status in etcd is the one we want

If I understand correctly, this turns a regular deploy into effectively an emergency deploy if the regular one doesn't work for some reason. Thus both continuing to accept more new requests, and abruptly aborting both those and the ones already in-flight. Or does php-fpm have its own graceful restart as well? I don't recall whether pool/depool is an optimisation or improved safety method or actually the only thing that distinguishes the default graceful from the abrupt/emergency restart.

In any event, I guess continuing is fine for now especially given retries, but it might be worth printing at least a warning in such a way that propagates to the Scap operator so that they know to possibly look for impact, especially if a lot of servers unexpectedly got emergency restarted. I believe we have a warning mechanism already where e.g. canary checks if failing are not fatal to the scap deploy but do print a line saying that mw1234 had raised levels FYI.

  • if the server is pooled and you cannot depool it, just restart and ensure the status in etcd is the one we want

If I understand correctly, this turns a regular deploy into effectively an emergency deploy if the regular one doesn't work for some reason. Thus both continuing to accept more new requests, and abruptly aborting both those and the ones already in-flight. Or does php-fpm have its own graceful restart as well? I don't recall whether pool/depool is an optimisation or improved safety method or actually the only thing that distinguishes the default graceful from the abrupt/emergency restart.

This is a good question, I had assumed that depooling was necessary due to php-fpm not handling this gracefully, but I'd be interested in the answer here as well.

In any event, I guess continuing is fine for now especially given retries, but it might be worth printing at least a warning in such a way that propagates to the Scap operator so that they know to possibly look for impact, especially if a lot of servers unexpectedly got emergency restarted. I believe we have a warning mechanism already where e.g. canary checks if failing are not fatal to the scap deploy but do print a line saying that mw1234 had raised levels FYI.

Issuing a warning to the scap operator should be what happens currently if a machine's run of the check-opcache-limit-and-restart-if-needed script exits non-0. Differentiating a failed restart from a failed-depool+successful restart might be tricky on the scap side(?)

Differentiating a failed restart […]

Maybe it can (or already does) pass-through standard err?

Differentiating a failed restart […]

Maybe it can (or already does) pass-through standard err?

Likely could, currently warns: "X host had problems restarting php": https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/d441da54a48e6fdad03567864b6027f78764a7bf/scap/main.py#517

EDIT:
BUT(!) Now I see that the scap.Job abstraction that the RestartPHP class uses will report stderr to the deployer on non-zero exit code. If the script exits non-zero, the deployer will see the reason why.

  • if the server is pooled and you cannot depool it, just restart and ensure the status in etcd is the one we want

If I understand correctly, this turns a regular deploy into effectively an emergency deploy if the regular one doesn't work for some reason. Thus both continuing to accept more new requests, and abruptly aborting both those and the ones already in-flight. Or does php-fpm have its own graceful restart as well? I don't recall whether pool/depool is an optimisation or improved safety method or actually the only thing that distinguishes the default graceful from the abrupt/emergency restart.

The latter, but to clarify: this fallback should only happen in the eventuality etcd doesn't work or - worse - that pybal is lagging so much it can't keep up with pooling/depooling

In any event, I guess continuing is fine for now especially given retries, but it might be worth printing at least a warning in such a way that propagates to the Scap operator so that they know to possibly look for impact, especially if a lot of servers unexpectedly got emergency restarted. I believe we have a warning mechanism already where e.g. canary checks if failing are not fatal to the scap deploy but do print a line saying that mw1234 had raised levels FYI.

Unless there is some (unlikely) catastrophic failure somewhere else in the stack, this wouldn't happen. I just want to have a safety measure against events like the above. But anyways, scap should indeed report if any process returned a non-zero exit code, so we will know something went wrong.

But anyways, scap should indeed report if any process returned a non-zero exit code, so we will know something went wrong.

Okay, so we can either:

  • Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" via an emergency restart. It sounds like @Joe is suggesting this?
  • Modify Scap to show always display the stderr to deployers. Given that it already treats "errors" (non-zero exit) as "warnings" (in that they are not counted toward a threshold and continues regardless), we probably don't need to distinguish between non-zero exit and stderr success with some different signal in scap since in the end they just end up showing a warning and continuing regardless, so I suppose we could in that case widen Scap's "warning" trigger to flip based on stderr OR non-zero exit, instead of now where it fires only on stderr. I think this is what @thcipriani and I were thinking, but only because we assumed the script would have to exit as success/zero if the restart itself succeeded.

Okay, so we can either:

  • Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" with the actual restart. It sounds like @Joe is suggesting this?

This is the best path forward. I have confirmed that if the restart script on a host exits non-zero, scap will print a warning message (<command> on <host> returned [<exit status>]: <stdout and stderr text>) and move on.

@Joe Can we leave it to you to ensure that the restart script exits non-zero when it falls back to an emergency restart?

Okay, so we can either:

  • Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" with the actual restart. It sounds like @Joe is suggesting this?

This is the best path forward. I have confirmed that if the restart script on a host exits non-zero, scap will print a warning message (<command> on <host> returned [<exit status>]: <stdout and stderr text>) and move on.

@Joe Can we leave it to you to ensure that the restart script exits non-zero when it falls back to an emergency restart?

Not that I'm familiar with this code, but it looks like if poolcounter executes the error callback that returns boolean False it'll return an exit code of 2: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/conftool/files/safe-service-restart.py#388

So whatever the error callback we're specifying is it'll just need to return False

@Joe is that somewhere near what remains to try this?

A data point, in case it's useful. I just promoted train to group1 and Scap told me

15:19:10 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1316.eqiad.wmnet returned [2]: NOT restarting php7.2-fpm: free opcache 312 MB
Fragmentation is at 31%, nothing to do here
Restarting php7.2-fpm: Number of cached keys 27868 is over the 27531 limit
2021-03-03 15:19:10,525 [ERROR] Error running command with poolcounter: timed out

But it didn't error out in the end.

A data point, in case it's useful. I just promoted train to group1 and Scap told me

15:19:10 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1316.eqiad.wmnet returned [2]: NOT restarting php7.2-fpm: free opcache 312 MB
Fragmentation is at 31%, nothing to do here
Restarting php7.2-fpm: Number of cached keys 27868 is over the 27531 limit
2021-03-03 15:19:10,525 [ERROR] Error running command with poolcounter: timed out

But it didn't error out in the end.

This seems like validation that:

BUT(!) Now I see that the scap.Job abstraction that the RestartPHP class uses will report stderr to the deployer on non-zero exit code. If the script exits non-zero, the deployer will see the reason why.

Works :)

It doesn't look like there was an unsafe restart after failing to depool though.

Last restart of php-fpm on that machine was Wed 2021-03-03 16:27:21 (according to service php7.3-fpm status locally). Looking at logstash using "program:php7.2-fpm host:mw1316" it looks like that restart was the only restart that day -- unclear what triggered it (cron?)

@thcipriani @dancy we will try to get what is needed and get back to you, sorry for the delays

Change 682141 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] modules::conftool add safe-service-restart scap option

https://gerrit.wikimedia.org/r/682141

Looks like all that remains is to enable the feature-flag for a restart and see how long a deploy takes. @jijiki do you want to coordinate a specific time so you can watch closely, or should we just do it when we're sure people are around in case of explosion?

Tomorrow at 16:00 UTC (9am US/Pacific) @dancy and I will be testing post-deployment-rolling-php-fpm-restart.

We will test by running scap sync-file README and verifying that rolling php-rpm restart runs smoothly. We will also be measuring the time it takes to complete the operation.

Though we expect success, we would appreciate support coverage from SRE while we run the test to help recover in case something goes terribly wrong. Thanks in advance!

We ran the test today. @jijiki supplied SRE backup.

It ran in two phases:

Phase 1: php_fpm_always_restart: false (its usual state at the moment)
scap sync-file README
Total time: 1m19.942s

php_fpm_always_restart: true:
scap sync-file README
Total time: 3m12.836s

After the test it was noted in #wikimedia-operations that there was a small spike of HTTP 500 errors during phase 1 and a larger spike during phase 2. A discussion ensued and it was concluded that this is expected, unavoidable, and an acceptable outcome.

Change 719307 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] ::profile::mediawiki::common.pp: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force

https://gerrit.wikimedia.org/r/719307

Change 719307 merged by Giuseppe Lavagetto:

[operations/puppet@production] Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force

https://gerrit.wikimedia.org/r/719307

We ran the test today. @jijiki supplied SRE backup.

It ran in two phases:

Phase 1: php_fpm_always_restart: false (its usual state at the moment)
scap sync-file README
Total time: 1m19.942s

php_fpm_always_restart: true:
scap sync-file README
Total time: 3m12.836s

After the test it was noted in #wikimedia-operations that there was a small spike of HTTP 500 errors during phase 1 and a larger spike during phase 2. A discussion ensued and it was concluded that this is expected, unavoidable, and an acceptable outcome.

It is a surprise to me that without restart there would be an increase in "normal" HTTP 500 errors during the deploy. I'm not aware of this being the status quo or what causal link that would have.

A slight increase in errors with restarts seems expected in principle given that we can't ensure syncs complete within a given time frame (e.g. 3 minutes) in a way that has restarted all servers to serve new code and has not killed requestst that take longer than 1-2 minutes to complete. So some amount of requests cancellation seems expected, especially for long-running jobs and maintenance scripts.

But the amount of cancellation we saw, and the explanation provided by SRE, suggests that we are cancelling all on-going requests immediately when we start the depool/restart sequence for a particular server. This seems surprising to me, both because it seems easy to avoid and has fairly high end-user impact, and because conceptually the default sequence feels like a graceful restart, to contrast --force which is not enabled by default.

My understanding, roughly, is as follows, for the default "graceful" restart:

  • await depooling server.
  • await threshold for on-going requests (e.g. 1 second, which is >95th percentile of appserver latency).
  • restart php-fpm.
  • repool.

Also, given the use of poolcounter for the rolling restart and how the internal Scap code is organised, I'm assuming the code is aware of each server finishing the sequence, not just when a batch of N servers finishes the sequence. As such, if not already, it seems like it could be improved by doing a rolling restart in a way that resembles the idea of rolling a round wheel, rather than a kicking a square. Meaning, don't wait for outliers in PyBall or php-fpm commands to continue the next chunk when some are done. That should speed it up overall as well.

Having said that, the sync time is afaik not a blocker since it is already quite low (3min) which we previously accepted as being fast enough.

I mentioned this to @jijiki a few weeks ago, and she agreed a 1s threshold makes sense and should cover the majority of this.

One other thing that comes to mind is what we do with jobrunners and maintenance hosts.

jobrunners: I'm guessing these are included since they're straight forward php-fpm sync targets from Scap's perspective (despite being more CLI-like in terms of how we use it). Killing on-going jobs for every minor sync and deployment should be fine in principle, but it's certainly going to stress test how well everything is put together. It also emphasises the technical debt we have where some jobs are still not re-tryable and while that's fairly obscure today with jobs almost never being killed randomly, it'd be a lot more common if we start doing this. I don't have a good sense of this right now in order to assess the risk and likelihood of data loss (e.g. undelivered emails/notifications, outdated categories/search, pages not deleted). We could wing it and see what happens. Or we could take the oppertunity to audit job definitions and see which ones don't support retries and inform their owners and ask them to decide whether and what they want to do about it and whether they think it is urgent enough to block.

maintenance hosts: I'm guessing these are not affected by current restart logic, unless someone wrote code for killing on-going PHP CLI processes that are not managed by php-fpm. I'm guessing this is fine and we don't need to kill those?

Also, given the use of poolcounter for the rolling restart and how the internal Scap code is organised, I'm assuming the code is aware of each server finishing the sequence, not just when a batch of N servers finishes the sequence. As such, if not already, it seems like it could be improved by doing a rolling restart in a way that resembles the idea of rolling a round wheel, rather than a kicking a square. Meaning, don't wait for outliers in PyBall or php-fpm commands to continue the next chunk when some are done. That should speed it up overall as well.

I can confirm that scap just limits the maximum number of concurrent tasks. It does not perform tasks in discrete batches.

The test command for the record: scap sync-file -D php_fpm_always_restart:true README

We ran the test today. @jijiki supplied SRE backup.

It ran in two phases:

Phase 1: php_fpm_always_restart: false (its usual state at the moment)
scap sync-file README
Total time: 1m19.942s

php_fpm_always_restart: true:
scap sync-file README
Total time: 3m12.836s

After the test it was noted in #wikimedia-operations that there was a small spike of HTTP 500 errors during phase 1 and a larger spike during phase 2. A discussion ensued and it was concluded that this is expected, unavoidable, and an acceptable outcome.

It is a surprise to me that without restart there would be an increase in "normal" HTTP 500 errors during the deploy. I'm not aware of this being the status quo or what causal link that would have.

Depending on the nature of the deployment, yes it's normal to get a few 5xx errors - usually reported by the appservers, but they won't reach the user, unless it's an edit.

A slight increase in errors with restarts seems expected in principle given that we can't ensure syncs complete within a given time frame (e.g. 3 minutes) in a way that has restarted all servers to serve new code and has not killed requestst that take longer than 1-2 minutes to complete. So some amount of requests cancellation seems expected, especially for long-running jobs and maintenance scripts.

For maintenance scripts, no. But this could indeed prove problematic for videoscaling of large files. It's a problem we have to solve for k8s anyways.

But the amount of cancellation we saw, and the explanation provided by SRE, suggests that we are cancelling all on-going requests immediately when we start the depool/restart sequence for a particular server.

I just found the problem, which frankly is baffling to me - apparently the script we're using to restart from scap doesn't declare --grace-period as an argument of safe-service-restart, which is what determines the number of seconds we wait before restarting php-fpm. I am just going to change the default for safe-service-restart for now, because leaving it to zero seconds almost never seems like a good idea to me.

Change 752600 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] safe-service-restart: make the default grace period 3 seconds

https://gerrit.wikimedia.org/r/752600

Change 752600 merged by Giuseppe Lavagetto:

[operations/puppet@production] safe-service-restart: make the default grace period 3 seconds

https://gerrit.wikimedia.org/r/752600

Retest after the 3 second grace period commit was merged:

dancy@deploy1002:/srv/mediawiki-staging$ scap sync-file -D php_fpm_always_restart:true README "Testing php-fpm restart"
....
16:13:10 Started sync-apaches
sync-apaches: 100% (ok: 347; fail: 0; left: 0)
16:13:19 Finished sync-apaches (duration: 00m 08s)
16:13:19 Started php-fpm-restarts
16:13:19 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 352 host(s)
16:15:45 Finished php-fpm-restarts (duration: 02m 25s)
16:15:45 Synchronized README: Testing php-fpm restart (duration: 03m 18s)

@Joe monitored https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=63&orgId=1&from=1642175067857&to=1642179176551 during the operation. The peak 503 error rate during the restart phase was 0.433 503's per second.

dancy updated the task description. (Show Details)

Hi @Joe . Can you update the description of this report with the remaining areas of work?

Change 792980 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap: do not restart jobrunners on deployment

https://gerrit.wikimedia.org/r/792980

Change 792981 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap: enable restarting php-fpm on deployment

https://gerrit.wikimedia.org/r/792981

Change 792982 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki::php: check opcache revalidation in restart script

https://gerrit.wikimedia.org/r/792982

Change 792983 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki_canaries: disable opcache revalidation

https://gerrit.wikimedia.org/r/792983

Change 792984 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki: disable revalidation everywhere

https://gerrit.wikimedia.org/r/792984

Change 792980 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap: do not restart jobrunners on deployment

https://gerrit.wikimedia.org/r/792980

Change 792981 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap: enable restarting php-fpm on deployment

https://gerrit.wikimedia.org/r/792981

Mentioned in SAL (#wikimedia-operations) [2022-05-30T08:42:41Z] <oblivian@deploy1002> Synchronized README: testing php restarts with scap, T266055 (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2022-05-30T09:53:34Z] <oblivian@deploy1002> Synchronized README: testing php restarts with scap, T266055 (duration: 03m 30s)

Change 792983 merged by Giuseppe Lavagetto:

[operations/puppet@production] mediawiki_canaries: disable opcache revalidation

https://gerrit.wikimedia.org/r/792983

Current status is:

  • Each deployment will restart php-fpm
  • api and appserver canaries have opcache revalidation turned off
  • Tomorrow we'll turn it off for all appservers as well.

This amongst other things means that until we do the restart, the new code will mostly not be available, making deployments slightly more atomic.

Change 801773 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/train-dev@master] Set php_fpm_always_restart to True in deploy:/etc/scap.cfg

https://gerrit.wikimedia.org/r/801773

Change 801773 merged by jenkins-bot:

[mediawiki/tools/train-dev@master] Set php_fpm_always_restart to True in deploy:/etc/scap.cfg

https://gerrit.wikimedia.org/r/801773

Change 801767 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/tools/scap@master] Restart canaries before checking, remove the opcache invalidation function

https://gerrit.wikimedia.org/r/801767

Change 801767 merged by jenkins-bot:

[mediawiki/tools/scap@master] Restart canaries before checking, remove the opcache invalidation function

https://gerrit.wikimedia.org/r/801767

Change 801807 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Don't attempt php fpm restart if not configured

https://gerrit.wikimedia.org/r/801807

Change 801807 merged by jenkins-bot:

[mediawiki/tools/scap@master] Don't attempt php fpm restart if not configured

https://gerrit.wikimedia.org/r/801807

Current status is:

  • Each deployment will restart php-fpm
  • api and appserver canaries have opcache revalidation turned off
  • Tomorrow we'll turn it off for all appservers as well.

This amongst other things means that until we do the restart, the new code will mostly not be available, making deployments slightly more atomic.

This did not exactly happen/was reverted because of a missing patch in scap. Now that scap 4.8.2 is rolled out (T309116).
I will tun off opcache revalidation on api/appserver cancaries now and potentially do all appservers tomorrow if everything looks fine.

Change 802134 had a related patch set uploaded (by JMeybohm; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki: stop revalidating opcache on canaries

https://gerrit.wikimedia.org/r/802134

Change 802134 merged by JMeybohm:

[operations/puppet@production] mediawiki: stop revalidating opcache on canaries

https://gerrit.wikimedia.org/r/802134

Mentioned in SAL (#wikimedia-operations) [2022-06-07T11:04:47Z] <jayme> restarting php-fpm on api and appserver canaries - T266055

Change 803955 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/803955

Change 803955 merged by Dzahn:

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/803955

Change 804441 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/804441

Change 804441 merged by Dzahn:

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/804441

Hey @dancy, do you think we're good to proceed with disabling revalidation for all of api,app,parsoid ?

Hey @dancy, do you think we're good to proceed with disabling revalidation for all of api,app,parsoid ?

Yes. Having opcache.validate_timestamps=0 on the canaries doesn't seem to have resulted in any problems.

Added to task description, and ticked off the first:

  • Set opcache.validate_timestamps=0 on prod canaries.
  • Set opcache.validate_timestamps=0 in Beta Cluster app servers that do php-fpm restarts.
  • Set opcache.validate_timestamps=0 on all appserver clusters that do php-fpm restarts.

I've moved the item about jobrunner to "Out of scope" to reflect the above decision. Changing jobrunner does not block the parent task.

Change 792984 merged by JMeybohm:

[operations/puppet@production] mediawiki: disable revalidation for api,app,parsoid clusters

https://gerrit.wikimedia.org/r/792984

Mentioned in SAL (#wikimedia-operations) [2022-06-23T10:25:18Z] <jayme> running restart-php7.2-fpm A:parsoid or A:mw or A:mw-api to disable opcache revalidation - T266055

The ones left in prod are: mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet