⚓ T266055 Update Scap to perform rolling restart for all MW deploy

Subject	Repo	Branch	Lines +/-
modules::conftool add safe-service-restart scap option	operations/puppet	production	+19 -4
mediawiki::php: check opcache revalidation in restart script	operations/puppet	production	+22 -18
Add "scap php-fpm-restart"	mediawiki/tools/scap	master	+17 -0
Clean up php fpm restart	mediawiki/tools/scap	master	+44 -56
mediawiki: disable revalidation for api,app,parsoid clusters	operations/puppet	production	+9 -6
scap.cfg.erb: Define php_fpm restart settings for beta cluster	operations/puppet	production	+4 -0
scap.cfg.erb: Define php_fpm restart settings for beta cluster	operations/puppet	production	+3 -0
mediawiki: stop revalidating opcache on canaries	operations/puppet	production	+6 -4
Don't attempt php fpm restart if not configured	mediawiki/tools/scap	master	+21 -13
Restart canaries before checking, remove the opcache invalidation function	mediawiki/tools/scap	master	+43 -225
Set php_fpm_always_restart to True in deploy:/etc/scap.cfg	mediawiki/tools/train-dev	master	+1 -1
mediawiki_canaries: disable opcache revalidation	operations/puppet	production	+6 -4
scap: enable restarting php-fpm on deployment	operations/puppet	production	+2 -0
scap: do not restart jobrunners on deployment	operations/puppet	production	+1 -1
safe-service-restart: make the default grace period 3 seconds	operations/puppet	production	+3 -2
Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force	operations/puppet	production	+1 -0
poolcounter: add client configuration classes	operations/puppet	production	+38 -0
profile::lvs::realserver: use poolcounter for guarding service restarts	operations/puppet	production	+10 -12
restbase: add poolcounter support to safe-service-restart scripts	operations/puppet	production	+12 -0
profile::lvs::realserver: add ability to configure poolcounter for pools	operations/puppet	production	+64 -4
safe-service-restart: add optional poolcounter support	operations/puppet	production	+84 -2

Status	Assigned	Task
Resolved	Krinkle	T212460 Adopt static array files for local disk storage of values (epic)
Open	None	T99740 Use static php array files for l10n cache at WMF (instead of CDB)
Resolved	Krinkle	T245183 PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.)
Resolved	Krinkle	T253673 Avoid php-opcache corruption in WMF production
Resolved	Joe	T266055 Update Scap to perform rolling restart for all MW deploy
Resolved	dancy	T243009 Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers
Resolved	thcipriani	T264362 Scap feature: restart php-fpm on deployment
Resolved	dancy	T290038 scap sync-file --force warns "sudo: no tty present and no askpass program specified"
Resolved	dancy	T237033 Scap can't clear opcache on mw servers in Beta Cluster
Resolved	Krinkle	T311788 MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off)

But anyways, scap should indeed report if any process returned a non-zero exit code, so we will know something went wrong.

Okay, so we can either:

Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" via an emergency restart. It sounds like @Joe is suggesting this?
Modify Scap to show always display the stderr to deployers. Given that it already treats "errors" (non-zero exit) as "warnings" (in that they are not counted toward a threshold and continues regardless), we probably don't need to distinguish between non-zero exit and stderr success with some different signal in scap since in the end they just end up showing a warning and continuing regardless, so I suppose we could in that case widen Scap's "warning" trigger to flip based on stderr OR non-zero exit, instead of now where it fires only on stderr. I think this is what @thcipriani and I were thinking, but only because we assumed the script would have to exit as success/zero if the restart itself succeeded.

In T266055#6798797, @Krinkle wrote:

Okay, so we can either:

Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" with the actual restart. It sounds like @Joe is suggesting this?

This is the best path forward. I have confirmed that if the restart script on a host exits non-zero, scap will print a warning message (<command> on <host> returned [<exit status>]: <stdout and stderr text>) and move on.

@Joe Can we leave it to you to ensure that the restart script exits non-zero when it falls back to an emergency restart?

In T266055#6820394, @dancy wrote:

In T266055#6798797, @Krinkle wrote:

Okay, so we can either:

Modify the restart script so that when it degrades to an emergency restart (due to etcd/pyball fail), to always return non-zero exit code even if it ended up "succeeding" with the actual restart. It sounds like @Joe is suggesting this?

This is the best path forward. I have confirmed that if the restart script on a host exits non-zero, scap will print a warning message (<command> on <host> returned [<exit status>]: <stdout and stderr text>) and move on.

@Joe Can we leave it to you to ensure that the restart script exits non-zero when it falls back to an emergency restart?

Not that I'm familiar with this code, but it looks like if poolcounter executes the error callback that returns boolean False it'll return an exit code of 2: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/conftool/files/safe-service-restart.py#388

So whatever the error callback we're specifying is it'll just need to return False

@Joe is that somewhere near what remains to try this?

A data point, in case it's useful. I just promoted train to group1 and Scap told me

15:19:10 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1316.eqiad.wmnet returned [2]: NOT restarting php7.2-fpm: free opcache 312 MB
Fragmentation is at 31%, nothing to do here
Restarting php7.2-fpm: Number of cached keys 27868 is over the 27531 limit
2021-03-03 15:19:10,525 [ERROR] Error running command with poolcounter: timed out

But it didn't error out in the end.

In T266055#6878922, @LarsWirzenius wrote:

A data point, in case it's useful. I just promoted train to group1 and Scap told me

15:19:10 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mw1316.eqiad.wmnet returned [2]: NOT restarting php7.2-fpm: free opcache 312 MB
Fragmentation is at 31%, nothing to do here
Restarting php7.2-fpm: Number of cached keys 27868 is over the 27531 limit
2021-03-03 15:19:10,525 [ERROR] Error running command with poolcounter: timed out

But it didn't error out in the end.

This seems like validation that:

In T266055#6794415, @thcipriani wrote:

BUT(!) Now I see that the scap.Job abstraction that the RestartPHP class uses will report stderr to the deployer on non-zero exit code. If the script exits non-zero, the deployer will see the reason why.

Works :)

It doesn't look like there was an unsafe restart after failing to depool though.

Last restart of php-fpm on that machine was Wed 2021-03-03 16:27:21 (according to service php7.3-fpm status locally). Looking at logstash using "program:php7.2-fpm host:mw1316" it looks like that restart was the only restart that day -- unclear what triggered it (cron?)

@thcipriani @dancy we will try to get what is needed and get back to you, sorry for the delays

Krinkle mentioned this in T253673: Avoid php-opcache corruption in WMF production.Mar 25 2021, 10:46 PM

jijiki added a project: User-jijiki.Apr 13 2021, 7:55 AM

thcipriani removed a project: Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)).Apr 20 2021, 12:21 AM

thcipriani added a project: Release-Engineering-Team-TODO (2021-04-01 to 2021-06-30 (Q4)).Apr 20 2021, 12:34 AM

thcipriani removed a project: Release-Engineering-Team (Deployment services).Apr 20 2021, 1:10 AM

thcipriani edited projects, added Release-Engineering-Team (Doing); removed Release-Engineering-Team-TODO (2021-04-01 to 2021-06-30 (Q4)).Apr 20 2021, 4:25 AM

jijiki moved this task from Incoming🐅 to In Progress 🏋️‍♀️ on the User-jijiki board.Apr 23 2021, 2:52 PM

Change 682141 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] modules::conftool add safe-service-restart scap option

https://gerrit.wikimedia.org/r/682141

thcipriani removed dancy as the assignee of this task.May 4 2021, 4:39 PM

thcipriani edited projects, added Release-Engineering-Team (Radar); removed Release-Engineering-Team (Doing).

thcipriani moved this task from Limbo to Done (within RelEng) on the Release-Engineering-Team (Radar) board.

Looks like all that remains is to enable the feature-flag for a restart and see how long a deploy takes. @jijiki do you want to coordinate a specific time so you can watch closely, or should we just do it when we're sure people are around in case of explosion?

• LarsWirzenius unsubscribed.Jun 15 2021, 5:08 AM

Tomorrow at 16:00 UTC (9am US/Pacific) @dancy and I will be testing post-deployment-rolling-php-fpm-restart.

We will test by running scap sync-file README and verifying that rolling php-rpm restart runs smoothly. We will also be measuring the time it takes to complete the operation.

Though we expect success, we would appreciate support coverage from SRE while we run the test to help recover in case something goes terribly wrong. Thanks in advance!

• dpifke subscribed.Aug 11 2021, 1:23 AM

dancy updated the task description. (Show Details)Aug 11 2021, 4:23 PM

We ran the test today. @jijiki supplied SRE backup.

It ran in two phases:

Phase 1: php_fpm_always_restart: false (its usual state at the moment)
scap sync-file README
Total time: 1m19.942s

php_fpm_always_restart: true:
scap sync-file README
Total time: 3m12.836s

After the test it was noted in #wikimedia-operations that there was a small spike of HTTP 500 errors during phase 1 and a larger spike during phase 2. A discussion ensued and it was concluded that this is expected, unavoidable, and an acceptable outcome.

dancy updated the task description. (Show Details)Aug 11 2021, 4:39 PM

RhinosF1 subscribed.Aug 11 2021, 5:03 PM

Krinkle mentioned this in T290120: Cannot declare class Wikimedia\MWConfig\XWikimediaDebug, because the name is already in use in XWikimediaDebug.php.Sep 1 2021, 11:20 AM

dancy added a subtask: T290038: scap sync-file --force warns "sudo: no tty present and no askpass program specified".Sep 7 2021, 5:26 PM

Change 719307 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] ::profile::mediawiki::common.pp: Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force

https://gerrit.wikimedia.org/r/719307

Change 719307 merged by Giuseppe Lavagetto:

[operations/puppet@production] Allow mwdeploy to run sudo /usr/local/sbin/restart-php7.2-fpm --force

https://gerrit.wikimedia.org/r/719307

dancy closed subtask T290038: scap sync-file --force warns "sudo: no tty present and no askpass program specified" as Resolved.Sep 13 2021, 3:12 PM

In T266055#7275900, @dancy wrote:

We ran the test today. @jijiki supplied SRE backup.

It ran in two phases:

Phase 1: php_fpm_always_restart: false (its usual state at the moment)
scap sync-file README
Total time: 1m19.942s

php_fpm_always_restart: true:
scap sync-file README
Total time: 3m12.836s

After the test it was noted in #wikimedia-operations that there was a small spike of HTTP 500 errors during phase 1 and a larger spike during phase 2. A discussion ensued and it was concluded that this is expected, unavoidable, and an acceptable outcome.

It is a surprise to me that without restart there would be an increase in "normal" HTTP 500 errors during the deploy. I'm not aware of this being the status quo or what causal link that would have.

A slight increase in errors with restarts seems expected in principle given that we can't ensure syncs complete within a given time frame (e.g. 3 minutes) in a way that has restarted all servers to serve new code and has not killed requestst that take longer than 1-2 minutes to complete. So some amount of requests cancellation seems expected, especially for long-running jobs and maintenance scripts.

But the amount of cancellation we saw, and the explanation provided by SRE, suggests that we are cancelling all on-going requests immediately when we start the depool/restart sequence for a particular server. This seems surprising to me, both because it seems easy to avoid and has fairly high end-user impact, and because conceptually the default sequence feels like a graceful restart, to contrast --force which is not enabled by default.

My understanding, roughly, is as follows, for the default "graceful" restart:

await depooling server.
await threshold for on-going requests (e.g. 1 second, which is >95th percentile of appserver latency).
restart php-fpm.
repool.

Also, given the use of poolcounter for the rolling restart and how the internal Scap code is organised, I'm assuming the code is aware of each server finishing the sequence, not just when a batch of N servers finishes the sequence. As such, if not already, it seems like it could be improved by doing a rolling restart in a way that resembles the idea of rolling a round wheel, rather than a kicking a square. Meaning, don't wait for outliers in PyBall or php-fpm commands to continue the next chunk when some are done. That should speed it up overall as well.

Having said that, the sync time is afaik not a blocker since it is already quite low (3min) which we previously accepted as being fast enough.

I mentioned this to @jijiki a few weeks ago, and she agreed a 1s threshold makes sense and should cover the majority of this.

One other thing that comes to mind is what we do with jobrunners and maintenance hosts.

jobrunners: I'm guessing these are included since they're straight forward php-fpm sync targets from Scap's perspective (despite being more CLI-like in terms of how we use it). Killing on-going jobs for every minor sync and deployment should be fine in principle, but it's certainly going to stress test how well everything is put together. It also emphasises the technical debt we have where some jobs are still not re-tryable and while that's fairly obscure today with jobs almost never being killed randomly, it'd be a lot more common if we start doing this. I don't have a good sense of this right now in order to assess the risk and likelihood of data loss (e.g. undelivered emails/notifications, outdated categories/search, pages not deleted). We could wing it and see what happens. Or we could take the oppertunity to audit job definitions and see which ones don't support retries and inform their owners and ask them to decide whether and what they want to do about it and whether they think it is urgent enough to block.

maintenance hosts: I'm guessing these are not affected by current restart logic, unless someone wrote code for killing on-going PHP CLI processes that are not managed by php-fpm. I'm guessing this is fine and we don't need to kill those?

In T266055#7445971, @Krinkle wrote:

Also, given the use of poolcounter for the rolling restart and how the internal Scap code is organised, I'm assuming the code is aware of each server finishing the sequence, not just when a batch of N servers finishes the sequence. As such, if not already, it seems like it could be improved by doing a rolling restart in a way that resembles the idea of rolling a round wheel, rather than a kicking a square. Meaning, don't wait for outliers in PyBall or php-fpm commands to continue the next chunk when some are done. That should speed it up overall as well.

I can confirm that scap just limits the maximum number of concurrent tasks. It does not perform tasks in discrete batches.

The test command for the record: scap sync-file -D php_fpm_always_restart:true README

Krinkle mentioned this in T293943: Enable mediawiki appserver metrics for jobrunner hosts.Oct 20 2021, 10:30 PM

In T266055#7445971, @Krinkle wrote:

In T266055#7275900, @dancy wrote:

We ran the test today. @jijiki supplied SRE backup.

It ran in two phases:

Phase 1: php_fpm_always_restart: false (its usual state at the moment)
scap sync-file README
Total time: 1m19.942s

php_fpm_always_restart: true:
scap sync-file README
Total time: 3m12.836s

After the test it was noted in #wikimedia-operations that there was a small spike of HTTP 500 errors during phase 1 and a larger spike during phase 2. A discussion ensued and it was concluded that this is expected, unavoidable, and an acceptable outcome.

It is a surprise to me that without restart there would be an increase in "normal" HTTP 500 errors during the deploy. I'm not aware of this being the status quo or what causal link that would have.

Depending on the nature of the deployment, yes it's normal to get a few 5xx errors - usually reported by the appservers, but they won't reach the user, unless it's an edit.

A slight increase in errors with restarts seems expected in principle given that we can't ensure syncs complete within a given time frame (e.g. 3 minutes) in a way that has restarted all servers to serve new code and has not killed requestst that take longer than 1-2 minutes to complete. So some amount of requests cancellation seems expected, especially for long-running jobs and maintenance scripts.

For maintenance scripts, no. But this could indeed prove problematic for videoscaling of large files. It's a problem we have to solve for k8s anyways.

But the amount of cancellation we saw, and the explanation provided by SRE, suggests that we are cancelling all on-going requests immediately when we start the depool/restart sequence for a particular server.

I just found the problem, which frankly is baffling to me - apparently the script we're using to restart from scap doesn't declare --grace-period as an argument of safe-service-restart, which is what determines the number of seconds we wait before restarting php-fpm. I am just going to change the default for safe-service-restart for now, because leaving it to zero seconds almost never seems like a good idea to me.

Change 752600 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] safe-service-restart: make the default grace period 3 seconds

https://gerrit.wikimedia.org/r/752600

Change 752600 merged by Giuseppe Lavagetto:

[operations/puppet@production] safe-service-restart: make the default grace period 3 seconds

https://gerrit.wikimedia.org/r/752600

Retest after the 3 second grace period commit was merged:

dancy@deploy1002:/srv/mediawiki-staging$ scap sync-file -D php_fpm_always_restart:true README "Testing php-fpm restart"
....
16:13:10 Started sync-apaches
sync-apaches: 100% (ok: 347; fail: 0; left: 0)
16:13:19 Finished sync-apaches (duration: 00m 08s)
16:13:19 Started php-fpm-restarts
16:13:19 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 9223372036854775807' on 352 host(s)
16:15:45 Finished php-fpm-restarts (duration: 02m 25s)
16:15:45 Synchronized README: Testing php-fpm restart (duration: 03m 18s)

@Joe monitored https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=63&orgId=1&from=1642175067857&to=1642179176551 during the operation. The peak 503 error rate during the restart phase was 0.433 503's per second.

Hi @Joe . Can you update the description of this report with the remaining areas of work?

dancy mentioned this in T302631: scap: add progress reporting to php-fpm-restarts.Feb 25 2022, 8:13 PM

Krinkle mentioned this in T302465: Deprecate "/static/current" at WMF in favour of similar long-cache unversioned /w/ URLs.Mar 16 2022, 1:40 PM

thcipriani mentioned this in T307748: Error: Call to undefined method MediaWiki\User\TempUser\RealTempUserConfig::isEnabled().May 9 2022, 7:33 PM

Change 792980 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap: do not restart jobrunners on deployment

https://gerrit.wikimedia.org/r/792980

Change 792981 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap: enable restarting php-fpm on deployment

https://gerrit.wikimedia.org/r/792981

Change 792982 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki::php: check opcache revalidation in restart script

https://gerrit.wikimedia.org/r/792982

Change 792983 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki_canaries: disable opcache revalidation

https://gerrit.wikimedia.org/r/792983

Change 792984 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki: disable revalidation everywhere

https://gerrit.wikimedia.org/r/792984

Change 792980 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap: do not restart jobrunners on deployment

https://gerrit.wikimedia.org/r/792980

Change 792981 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap: enable restarting php-fpm on deployment

https://gerrit.wikimedia.org/r/792981

Mentioned in SAL (#wikimedia-operations) [2022-05-30T08:42:41Z] <oblivian@deploy1002> Synchronized README: testing php restarts with scap, T266055 (duration: 00m 45s)

Mentioned in SAL (#wikimedia-operations) [2022-05-30T09:53:34Z] <oblivian@deploy1002> Synchronized README: testing php restarts with scap, T266055 (duration: 03m 30s)

Change 792983 merged by Giuseppe Lavagetto:

[operations/puppet@production] mediawiki_canaries: disable opcache revalidation

https://gerrit.wikimedia.org/r/792983

Current status is:

Each deployment will restart php-fpm
api and appserver canaries have opcache revalidation turned off
Tomorrow we'll turn it off for all appservers as well.

This amongst other things means that until we do the restart, the new code will mostly not be available, making deployments slightly more atomic.

JMeybohm subscribed.May 31 2022, 10:42 AM

Joe claimed this task.May 31 2022, 10:43 AM

Change 801773 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/train-dev@master] Set php_fpm_always_restart to True in deploy:/etc/scap.cfg

https://gerrit.wikimedia.org/r/801773

Change 801773 merged by jenkins-bot:

[mediawiki/tools/train-dev@master] Set php_fpm_always_restart to True in deploy:/etc/scap.cfg

https://gerrit.wikimedia.org/r/801773

Change 801767 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/tools/scap@master] Restart canaries before checking, remove the opcache invalidation function

https://gerrit.wikimedia.org/r/801767

Change 801767 merged by jenkins-bot:

[mediawiki/tools/scap@master] Restart canaries before checking, remove the opcache invalidation function

https://gerrit.wikimedia.org/r/801767

Change 801807 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Don't attempt php fpm restart if not configured

https://gerrit.wikimedia.org/r/801807

Change 801807 merged by jenkins-bot:

[mediawiki/tools/scap@master] Don't attempt php fpm restart if not configured

https://gerrit.wikimedia.org/r/801807

Krinkle mentioned this in T237033: Scap can't clear opcache on mw servers in Beta Cluster.Jun 2 2022, 1:59 AM

In T266055#7967750, @Joe wrote:

Current status is:

Each deployment will restart php-fpm

api and appserver canaries have opcache revalidation turned off

Tomorrow we'll turn it off for all appservers as well.

This amongst other things means that until we do the restart, the new code will mostly not be available, making deployments slightly more atomic.

This did not exactly happen/was reverted because of a missing patch in scap. Now that scap 4.8.2 is rolled out (T309116).
I will tun off opcache revalidation on api/appserver cancaries now and potentially do all appservers tomorrow if everything looks fine.

Change 802134 had a related patch set uploaded (by JMeybohm; author: Giuseppe Lavagetto):

[operations/puppet@production] mediawiki: stop revalidating opcache on canaries

https://gerrit.wikimedia.org/r/802134

Change 802134 merged by JMeybohm:

[operations/puppet@production] mediawiki: stop revalidating opcache on canaries

https://gerrit.wikimedia.org/r/802134

Mentioned in SAL (#wikimedia-operations) [2022-06-07T11:04:47Z] <jayme> restarting php-fpm on api and appserver canaries - T266055

Change 803955 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/803955

Change 803955 merged by Dzahn:

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/803955

Krinkle mentioned this in T310242: Scap's php-fpm restart step "left" counter may be counter-intuitive.Jun 9 2022, 12:42 AM

Change 804441 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/804441

Change 804441 merged by Dzahn:

[operations/puppet@production] scap.cfg.erb: Define php_fpm restart settings for beta cluster

https://gerrit.wikimedia.org/r/804441

Hey @dancy, do you think we're good to proceed with disabling revalidation for all of api,app,parsoid ?

Krinkle added a subtask: T237033: Scap can't clear opcache on mw servers in Beta Cluster.Jun 16 2022, 8:59 PM

In T266055#8008163, @JMeybohm wrote:

Hey @dancy, do you think we're good to proceed with disabling revalidation for all of api,app,parsoid ?

Yes. Having opcache.validate_timestamps=0 on the canaries doesn't seem to have resulted in any problems.

Added to task description, and ticked off the first:

Set opcache.validate_timestamps=0 on prod canaries.

Set opcache.validate_timestamps=0 in Beta Cluster app servers that do php-fpm restarts.

Set opcache.validate_timestamps=0 on all appserver clusters that do php-fpm restarts.

I've moved the item about jobrunner to "Out of scope" to reflect the above decision. Changing jobrunner does not block the parent task.

Change 792984 merged by JMeybohm:

[operations/puppet@production] mediawiki: disable revalidation for api,app,parsoid clusters

https://gerrit.wikimedia.org/r/792984

Mentioned in SAL (#wikimedia-operations) [2022-06-23T10:25:18Z] <jayme> running restart-php7.2-fpm A:parsoid or A:mw or A:mw-api to disable opcache revalidation - T266055

The ones left in prod are: mwdebug[2001-2002].codfw.wmnet,mwdebug[1001-1002].eqiad.wmnet

Krinkle mentioned this in T245183: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.).Jun 30 2022, 8:32 PM

dancy added a subtask: T311788: MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off).Jun 30 2022, 10:47 PM

Change 811373 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Clean up php fpm restart

https://gerrit.wikimedia.org/r/811373

Change 811373 merged by jenkins-bot:

[mediawiki/tools/scap@master] Clean up php fpm restart

https://gerrit.wikimedia.org/r/811373

dancy closed subtask T311788: MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off) as Resolved.Jul 19 2022, 9:33 PM

• dpifke unsubscribed.Jul 19 2022, 9:40 PM

dancy reopened subtask T311788: MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off) as Open.Jul 20 2022, 8:32 PM

dancy closed subtask T311788: MW wmf-config tmp cache stays outdated after Scap deploy (opcache revalidation is off) as Resolved.Jul 21 2022, 10:07 PM

Krinkle mentioned this in T99740: Use static php array files for l10n cache at WMF (instead of CDB).Jul 28 2022, 4:05 AM

Joe closed this task as Resolved.Jul 28 2022, 6:11 AM

Joe updated the task description. (Show Details)

Krinkle mentioned this in T314240: Enable rolling restart for all MW servers (tracking).Jul 31 2022, 6:28 AM

Change 825873 had a related patch set uploaded (by Ahmon Dancy; author: Ahmon Dancy):

[mediawiki/tools/scap@master] Add "scap php-fpm-restart"

https://gerrit.wikimedia.org/r/825873

Change 825873 merged by jenkins-bot:

[mediawiki/tools/scap@master] Add "scap php-fpm-restart"

https://gerrit.wikimedia.org/r/825873

Change 682141 abandoned by Effie Mouzeli:

[operations/puppet@production] modules::conftool add safe-service-restart scap option

Reason:

not relevant any more

https://gerrit.wikimedia.org/r/682141

Update Scap to perform rolling restart for all MW deploy
Closed, ResolvedPublic
Actions

Description

Details

Related Objects
Search...

Event Timeline

	F34052622: latency-p99-php-restarts.png
	Jan 28 2021, 11:25 AM

	Krinkle
	Oct 20 2020, 5:40 PM

Update Scap to perform rolling restart for all MW deployClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Update Scap to perform rolling restart for all MW deploy
Closed, ResolvedPublic
Actions

Related Objects
Search...