Enhance MediaWiki deployments for support of php7.x
Closed, ResolvedPublic
Actions

Description

Our current deployment strategy for MediaWiki's code is as follows:

rsync files to servers
wait for the php interpreter to pick up the changes on the filesystem

This didn't work very well with HHVM, but it was working passably bad, so we never invested the time and resources to improve what we were doing. With php7, things are even worse:

PHP7 saves its opcode in the opcache, which stores opcode by filesystem path, not by inode like APCu did
opcache doesn't automatically detect changes on the filesystem, if not with a huge performance penalty
opcache can be reset upon release, but it causes all sorts of instabilities and dangerous inconsistencies (see T224491)
opcache can revalidate the content of files at regular intervals (what we're doing now)

Problems with our current approach are:

No atomicity on release, inconsistencies between different scripts can go as far as the whole revalidation frequency, as that's per-file and not global
Opcache can become full and might cause a huge performance hit.

Given the impossibility of reliably resetting the opcache, we're left with only one option: perform a rolling restart of the php-fpm daemons upon every release (except sync-file).

I see a few ways to go with this in the short term:

We implement the php/hhvm rolling restart cookbook for spicerack (dependent on https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/503947 being merged), and we set up scap to run it after all the code has been distributed
We implement a smart script on the appservers themselves that can be called by scap that will depool, restart and repool the appserver. We can make appservers wait in queue for their turn by using something like poolcounter.

In any case, it could be advisable to change the way we do SWATs: we might just merge change after change, test them on mwdebug, then deploy cluster-wide one single time.

It's clear this is not a long term solution in my mind. For that we'll have to radically rethink how we do code deploys.

Details

Subject	Repo	Branch	Lines +/-
scap: restart php-fpm if needed when doing a full deploy	operations/puppet	production	+13 -0
Fix restart_php	mediawiki/tools/scap	master	+13 -15
php-fpm: sudo_check_call takes a string	mediawiki/tools/scap	master	+4 -1
scap: set flag for check-and-restart-php	operations/puppet	production	+2 -0
mediawiki::common: Allow mwdeploy user to restart php-fpm	operations/puppet	production	+14 -3
php7x: restart php-fpm after all sync operations	mediawiki/tools/scap	master	+39 -2
php7x: restart on 'scap pull'	mediawiki/tools/scap	master	+42 -11
php7x: Add php_fpm module	mediawiki/tools/scap	master	+146 -0
mediawiki: run the cron for php restarts everywhere	operations/puppet	production	+22 -4
mediawiki::php: add daemon restart cronjob	operations/puppet	production	+81 -2
safe-service-restart: logging improvements	operations/puppet	production	+19 -3
mediawiki::php: increase opcache on jobrunners	operations/puppet	production	+5 -1
mediawiki::php::monitoring: add checks for opcache status	operations/puppet	production	+80 -1
mediawiki::php: increase opcache everywhere	operations/puppet	production	+2 -2
mediawiki: increase opcache space on canaries	operations/puppet	production	+2 -2
conftool::scripts: add a safe-service-restart script	operations/puppet	production	+234 -0

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T176370 Migrate to PHP 7 in WMF production
Resolved	thcipriani	T224857 Enhance MediaWiki deployments for support of php7.x
Resolved	Dzahn	T230144 Deploy scap 3.12.0-1 to production

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 514660 merged by Giuseppe Lavagetto:
[operations/puppet@production] conftool::scripts: add a safe-service-restart script

https://gerrit.wikimedia.org/r/514660

Maintenance_bot removed a project: Patch-For-Review.Jun 6 2019, 10:10 AM

Change 514695 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki: increase opcache space on canaries

https://gerrit.wikimedia.org/r/514695

Change 514695 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki: increase opcache space on canaries

https://gerrit.wikimedia.org/r/514695

Maintenance_bot removed a project: Patch-For-Review.Jun 6 2019, 12:10 PM

Change 514706 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php: increase opcache everywhere

https://gerrit.wikimedia.org/r/514706

gerritbot added a project: Patch-For-Review.Jun 6 2019, 12:40 PM

Lucas_Werkmeister_WMDE mentioned this in T225207: Enable scap to roll back broken changes to MediaWiki.Jun 6 2019, 1:26 PM

Change 514706 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: increase opcache everywhere

https://gerrit.wikimedia.org/r/514706

Maintenance_bot removed a project: Patch-For-Review.Jun 6 2019, 3:10 PM

mark subscribed.Jun 6 2019, 3:58 PM

Change 514799 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php::monitoring: add checks for opcache status

https://gerrit.wikimedia.org/r/514799

gerritbot added a project: Patch-For-Review.Jun 6 2019, 4:22 PM

Change 514799 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php::monitoring: add checks for opcache status

https://gerrit.wikimedia.org/r/514799

Maintenance_bot removed a project: Patch-For-Review.Jun 6 2019, 5:10 PM

A small comment about what has been done and what needs to be done:

We upped opcache to 1 GB. By my evaluations, that should last about 1 week of deployments before being exhausted. Maybe less
We added a check for the opcache health
we created a script for safely restarting php-fpm that could be also used during deployments

and finally

we're almost ready to deploy a periodic job that willl restart automatically php-fpm when at low cache space

This will allow us to survive a little longer, even to unblock the migration maybe, but this ticket /needs/ a more permanent solution, which means we need either to rethink how we distribute code in production, and specifically I think the low-hanging fruit is to do rolling restarts upon each deployment. We can solve the pybal slowness on one side, and create a cumin script scap can launch after every sync (and be done in a couple minutes).

Regarding rollbacks: I guess that we can add a switch that, just for rollbacks, sends out the opcache reset first, and does the rolling restart afterwards, thus reducing the window of time in which the issues are present. We'll have a small window of time in which an opcache corruption would be possible, but only in case of emergency.

@thcipriani would that work for addressing your concerns?

Reedy moved this task from Untriaged to Wikimedia production on the PHP 7.2 support board.Jun 8 2019, 11:50 PM

In T224857#5243133, @Joe wrote:

Regarding rollbacks: I guess that we can add a switch that, just for rollbacks, sends out the opcache reset first, and does the rolling restart afterwards, thus reducing the window of time in which the issues are present. We'll have a small window of time in which an opcache corruption would be possible, but only in case of emergency.

@thcipriani would that work for addressing your concerns?

Checking my understanding of what you're saying (please correct me where I'm misunderstanding): for rollback/emergencies, scap does opcache_reset, syncs the file, then calls the smart script to do a rolling restart (11 minutes -- 350 servers 2 seconds/server). By sending the opcache_reset first means that the change goes live immediately, and the restart is just for cache sanity.

If the above is correct: I thought opcache_reset was never safe?

Other small problem with scap is that for MediaWiki there is no specific rollback command (and, unfortunately doesn't rollback automagically T225207: Enable scap to roll back broken changes to MediaWiki); rollback is just a deploy to a previous version.

Random thought about this: I remember when hhvm was new there was talk about using https://github.com/facebook/hhvm/blob/a13d8a3981/hphp/runtime/server/takeover-agent.cpp to do zero downtime deploys. I have no idea if there exists a program similar to this for transferring file descriptors between php-fpm instances?

In T224857#5247276, @thcipriani wrote:

In T224857#5243133, @Joe wrote:

Regarding rollbacks: I guess that we can add a switch that, just for rollbacks, sends out the opcache reset first, and does the rolling restart afterwards, thus reducing the window of time in which the issues are present. We'll have a small window of time in which an opcache corruption would be possible, but only in case of emergency.

@thcipriani would that work for addressing your concerns?

Checking my understanding of what you're saying (please correct me where I'm misunderstanding): for rollback/emergencies, scap does opcache_reset, syncs the file, then calls the smart script to do a rolling restart (11 minutes -- 350 servers 2 seconds/server). By sending the opcache_reset first means that the change goes live immediately, and the restart is just for cache sanity.

We'd sync the files, reset the opcache (which would actually make the rollback effective) and then do the rolling restart afterwards.

If the above is correct: I thought opcache_reset was never safe?

It's not, but between a marginal chance of having corruption on one machine, and having widespread errors, I would assume the former is preferrable.

Other small problem with scap is that for MediaWiki there is no specific rollback command (and, unfortunately doesn't rollback automagically T225207: Enable scap to roll back broken changes to MediaWiki); rollback is just a deploy to a previous version.

What we'd need is to add a cli flag to scap, --force, that would actually do the opcache reset.

Random thought about this: I remember when hhvm was new there was talk about using https://github.com/facebook/hhvm/blob/a13d8a3981/hphp/runtime/server/takeover-agent.cpp to do zero downtime deploys. I have no idea if there exists a program similar to this for transferring file descriptors between php-fpm instances?

Sadly no :(

• LarsWirzenius subscribed.Jun 13 2019, 11:47 AM

In T224857#5253300, @Joe wrote:

In T224857#5247276, @thcipriani wrote:

Checking my understanding of what you're saying (please correct me where I'm misunderstanding): for rollback/emergencies, scap does opcache_reset, syncs the file, then calls the smart script to do a rolling restart (11 minutes -- 350 servers 2 seconds/server). By sending the opcache_reset first means that the change goes live immediately, and the restart is just for cache sanity.

We'd sync the files, reset the opcache (which would actually make the rollback effective) and then do the rolling restart afterwards.

Is the ~11 minutes accurate? Are there any changes that can be done from the pybal-side to speed that up?

Other small problem with scap is that for MediaWiki there is no specific rollback command (and, unfortunately doesn't rollback automagically T225207: Enable scap to roll back broken changes to MediaWiki); rollback is just a deploy to a previous version.

What we'd need is to add a cli flag to scap, --force, that would actually do the opcache reset.

Sure, or maybe prompt folks during scap sync-wikiversions so they don't have to remember a special flag for an emergency.

Are there any other paths forward beyond the depool/restart that may be able to keep deployments fast(ish)? Scap refactoring to avoid 11 minute file deploys seems preferable if it is at all possible.

I think I saw you post elsewhere this article about Etsy symlink swaps: https://codeascraft.com/2013/07/01/atomic-deploys-at-etsy/ That is, syncing to two different directories and swapping a symlink between them -- keeping two versions of the code in cache. From the scap side, this seems like something possible (though not trivial) to implement and test. Is this workable from the apache side?

It seems like the appservers have enough space for two copies of the code. This approach also has some added benefits/solves some longstanding bugs.

In T224857#5235896, @Joe wrote:

The problem is multifold:

revalidation removes atomicity. While this seems not to be an issue for us in the few releases we've had since moving away from opcache resets, I wouldn't trust this not being an issue when we do backports in swat, in particular. Given the way we do train deploys, those are mostly harmless.

While we do have a unique file path for each wikiversion (i.e., /srv/mediawiki/php-1.34.0-wmf.8 vs /srv/mediawiki/php-1.34.0-wmf.9) we still do rely on syncing a single file in mostly the same way as SWAT (wikiversions.php) to point each wiki to that directory.

Quick fly-by note from Performance perspective, current expectation and target of how MediaWiki has been optimised, is that the APC corpus survives at least 3-5 days.

More regular restarts than that should be coincided with an audit from us to confirm whether that would be expected to cause issues (especially in the many areas we don't have good monitoring on), and figure out alternate strategies as needed (e.g. more double-tiering through Memc, but that comes at a latency and code complexity cost as well).

Overall backend latencies, CPU load, or APC hit-miss ratios are not (yet) considered good proxies for this.

In T224857#5258042, @Krinkle wrote:

Quick fly-by note from Performance perspective, current expectation and target of how MediaWiki has been optimised, is that the APC corpus survives at least 3-5 days.

More regular restarts than that should be coincided with an audit from us to confirm whether that would be expected to cause issues (especially in the many areas we don't have good monitoring on), and figure out alternate strategies as needed (e.g. more double-tiering through Memc, but that comes at a latency and code complexity cost as well).

Overall backend latencies, CPU load, or APC hit-miss ratios are not (yet) considered good proxies for this.

The idea that we can't restart an appserver more often that 3-5 days sounds pretty strange at first sight, given the amount of traffic we get. Where is this figure taken from? Why? I'd need hard data to find this claim credible.

It might come as a surprise, but HHVM is constantly restarted on the api cluster, and has been known for years to incur the occasional segfault, making this assumption false even in the current context.

But my most important point is - having performance critically depend on a service not being restarted for days is operatively unacceptable, won't work if we want to do any form of autoscaling of capacity in the future, and needs to be fixed in the code.

Also, before building such implicit expectation in a software, it would be appreciated if the matter is discussed with SRE. We were never aware of such limitations, and we never operated according to it.

In T224857#5257986, @thcipriani wrote:

I think I saw you post elsewhere this article about Etsy symlink swaps: https://codeascraft.com/2013/07/01/atomic-deploys-at-etsy/ That is, syncing to two different directories and swapping a symlink between them -- keeping two versions of the code in cache. From the scap side, this seems like something possible (though not trivial) to implement and test. Is this workable from the apache side?

It seems like the appservers have enough space for two copies of the code. This approach also has some added benefits/solves some longstanding bugs.

The etsy solution is actually pretty bad, if you ask me. It relies on an apache module that forces you to use the prefork MPM, which is a performance bottleneck, and also quite antiquated.

I'd rather explore other similar avenues, but those are long/medium term solutions, and this would block the migration on the short term?

Anyways, next week I can work on doing some tests.

To summarize the situation a bit, we have 2 problems right now that need to be solved:

Deploys are not atomic
Opcache gets progressively exhausted by deploying code

Se can solve 1) using the symlink swapping and mod_realdoc or switching to nginx on the frontend, but the second problem can only be solved via regular resets or restarts of the opcache.

Given we don't trust the opcache resets at all at this point, the rolling restart of the daemons seem like the best option to solve both issues.

Change 518023 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki::php: increase opcache on jobrunners

https://gerrit.wikimedia.org/r/518023

gerritbot added a project: Patch-For-Review.Jun 20 2019, 12:07 PM

• fsero moved this task from Incoming 🐫 to Production Errors 🚜 on the serviceops board.Jun 20 2019, 2:16 PM

Change 518023 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki::php: increase opcache on jobrunners

https://gerrit.wikimedia.org/r/518023

Joe moved this task from Production Errors 🚜 to Doing 😎 on the serviceops board.Jun 21 2019, 7:57 AM

In T224857#5268558, @Joe wrote:

To summarize the situation a bit, we have 2 problems right now that need to be solved:

Deploys are not atomic

Opcache gets progressively exhausted by deploying code

Se can solve 1) using the symlink swapping and mod_realdoc or switching to nginx on the frontend, but the second problem can only be solved via regular resets or restarts of the opcache.

Given we don't trust the opcache resets at all at this point, the rolling restart of the daemons seem like the best option to solve both issues.

Should we be working to implement symlink swapping in scap? Sounds like it's currently a blocker for this task (although opcache exhaustion may be blocking both).

If so the key is that we need the realpath for each deployment needs to be unique, correct? Atomic deployments are a side effect of the unique real path?

greg added a project: Release-Engineering-Team-TODO.Jun 21 2019, 10:31 PM

In T224857#5274663, @thcipriani wrote:

Should we be working to implement symlink swapping in scap? Sounds like it's currently a blocker for this task (although opcache exhaustion may be blocking both).

If so the key is that we need the realpath for each deployment needs to be unique, correct? Atomic deployments are a side effect of the unique real path?

No, I don't think that's what we should focus on.

We want to solve both the atomicity problem and the opcache exhaustion problem, and the symlink swapping won't solve the latter, which is by far the more urgent matter.

As I said, rolling restarts on each deployment solve both problems, but for now given the doubts I've seen floating around we could find a middle ground as follows:

we only run the rolling restart when a signficant code path change happens, so when we run the train

this would unblock the php7 transition at the very least. Is that acceptable to you?

Change 518664 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] safe-service-restart: logging improvements

https://gerrit.wikimedia.org/r/518664

Change 518664 merged by Giuseppe Lavagetto:
[operations/puppet@production] safe-service-restart: logging improvements

https://gerrit.wikimedia.org/r/518664

In T224857#5277649, @Joe wrote:

In T224857#5274663, @thcipriani wrote:

Should we be working to implement symlink swapping in scap? Sounds like it's currently a blocker for this task (although opcache exhaustion may be blocking both).

If so the key is that we need the realpath for each deployment needs to be unique, correct? Atomic deployments are a side effect of the unique real path?

No, I don't think that's what we should focus on.

We want to solve both the atomicity problem and the opcache exhaustion problem, and the symlink swapping won't solve the latter, which is by far the more urgent matter.

As I said, rolling restarts on each deployment solve both problems, but for now given the doubts I've seen floating around we could find a middle ground as follows:

we only run the rolling restart when a signficant code path change happens, so when we run the train

this would unblock the php7 transition at the very least. Is that acceptable to you?

If the plan is to do a rolling restart when a significant code change happens then there's really only one sync per week where we make significant code changes -- the Tuesday sync; i.e., the sync that rolls out a new version to group0 wikis. Other syncs are changes to existing files, the addition of a small number of new files, or routing wikis to use new code rather than old code.

On Tuesdays servers will not run the new code during the initial scap sync, but at the end of sync-wikiversions when we rebuild wikiversions.php on each server.

Perhaps a rolling restart targeting only the Tuesday sync would be acceptable? A smart script that runs the nrpe check to determine if the opcache is almost full (if you think that would save any time) and does the other magic you've described (i.e., pybal fiddling, restart).

By doing the full restart only on the Tuesday sync, it would add some time to a long process (X minutes in addition to the 30 minute sync: not ideal, but not too terrible), but would allow rollback and SWAT to remain fast.

In scap this could be done by adding a flag to sync-wikiversions -- --[no]-cluster-restart if neither --cluster-restart or --no-cluster-restart is passed, then prompt Perform rolling cluster restart following wikiversions update? [Y/n] this should allow for script usage and provide instructions for folks not aware of the new flag's existence.

To the "deploys are not atomic" problem -- would solving that via symlink swaps exacerbate the opcache problem? That is, using a symlink swap would make every deployment a significant code change since every file (even when its unchanged) would have a unique realpath for every deployment.

akosiaris moved this task from Doing 😎 to this.quarter 🍕 on the serviceops board.Jun 24 2019, 3:33 PM

greg edited projects, added Release-Engineering-Team (Deployment services); removed Release-Engineering-Team.Jun 24 2019, 9:20 PM

Change 519207 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php: add daemon restart cronjob

https://gerrit.wikimedia.org/r/519207

Change 519208 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki: run the cron for php restarts everywhere

https://gerrit.wikimedia.org/r/519208

We could kick off a rolling restart also at the end of each SWAT window, just for good measure. Of course, that goes against @Krinkle's desire for keeping app servers running for > 3 days. Still waiting to hear more from him about the performance implications of restarting ~daily.

Change 519207 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: add daemon restart cronjob

https://gerrit.wikimedia.org/r/519207

Change 519208 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki: run the cron for php restarts everywhere

https://gerrit.wikimedia.org/r/519208

Note: we should think about how this affects canary tests in scap.

I had a meeting with @thcipriani yesterday and we came up with the following contingency plan:

We will keep opcache revalidation on for the time being
At every deploy, scap will run in batches (see below) /usr/local/sbin/check-and-restart-php php7.2-fpm 100; this will:
- check if the free opcache is above 100 MB, if so, it will just return immediately
- If not, it will depool/verify/restart service/pool/verify before returning with an appropriate exit code
This will mean that for an average deployment, no restart will happen and the execution will last a few seconds
Whenver a rolling restart is needed, that will happen only on the servers where it's needed.

How will the restart be performed:

Across three dsh groups: appservers, api, and jobrunners in parallel (pretty much what we did for opcache invalidation), across the two datacenters in parallel
In batches as large as at most 10% of the server pool
If any of the tasks fail in a batch, stop the action for that pool, report an error

The tentative agreement is the execution will be implemented in scap directly. Particular care will probably need to be given to canaries - we might want to restart them before we check logstash?
An alternative approach proposed - using a dedicated installation of cumin for running the commands - wouldn't allow very fine-grained interaction and is thus considered a last resort at this point.

Joe added a parent task: T176370: Migrate to PHP 7 in WMF production.Jun 27 2019, 9:47 AM

In T224857#5288470, @Joe wrote:

Whenver a rolling restart is needed, that will happen only on the servers where it's needed.

If that is so, are we going to remove the service restart timers we have now?

How will the restart be performed:

Across three dsh groups: appservers, api, and jobrunners in parallel (pretty much what we did for opcache invalidation), across the two datacenters in parallel

In batches as large as at most 10% of the server pool

If any of the tasks fail in a batch, stop the action for that pool, report an error

An alternative approach proposed - using a dedicated installation of cumin for running the commands - wouldn't allow very fine-grained interaction and is thus considered a last resort at this point.

I agree, let's try less complex solutions first :)

In T224857#5288802, @jijiki wrote:

In T224857#5288470, @Joe wrote:

Whenver a rolling restart is needed, that will happen only on the servers where it's needed.

If that is so, are we going to remove the service restart timers we have now?

I guess eventually we will. That's not the kind of solution we want to duct-tape and then depend upon long-term.

Agabi10 subscribed.Jun 27 2019, 1:27 PM

Jdforrester-WMF mentioned this in T224491: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262).Jun 28 2019, 2:26 AM

Fito subscribed.Jun 28 2019, 12:39 PM

greg moved this task from Should be empty (use Release-Engineering-Team) to Next on the Release-Engineering-Team-TODO board.Jul 6 2019, 6:09 AM

@thcipriani do you need more information from us? Any idea when work on scap will begin?

In T224857#5352559, @Joe wrote:

@thcipriani do you need more information from us? Any idea when work on scap will begin?

I started working on this late last week (but forgot to update workboard tags). My goal is to have a patchset end of this week/early next week.

In T224857#5354823, @thcipriani wrote:

In T224857#5352559, @Joe wrote:

@thcipriani do you need more information from us? Any idea when work on scap will begin?

I started working on this late last week (but forgot to update workboard tags). My goal is to have a patchset end of this week/early next week.

That's great, thanks! Let me know if I can help.

greg assigned this task to thcipriani.Jul 23 2019, 11:54 AM

greg moved this task from INBOX to Doing on the Release-Engineering-Team-TODO (201907) board.

Change 525117 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php7x: Add php_fpm module

https://gerrit.wikimedia.org/r/525117

Change 525118 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php7x: restart on 'scap pull'

https://gerrit.wikimedia.org/r/525118

Change 525119 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php7x: restart php-fpm after all sync operations

https://gerrit.wikimedia.org/r/525119

greg edited projects, added Release-Engineering-Team-TODO (201908); removed Release-Engineering-Team-TODO (201907).Jul 29 2019, 4:18 PM

greg moved this task from INBOX to Doing on the Release-Engineering-Team-TODO (201908) board.

Change 525117 merged by jenkins-bot:
[mediawiki/tools/scap@master] php7x: Add php_fpm module

https://gerrit.wikimedia.org/r/525117

Change 525118 merged by jenkins-bot:
[mediawiki/tools/scap@master] php7x: restart on 'scap pull'

https://gerrit.wikimedia.org/r/525118

Change 525119 merged by jenkins-bot:
[mediawiki/tools/scap@master] php7x: restart php-fpm after all sync operations

https://gerrit.wikimedia.org/r/525119

Krinkle mentioned this in T230037: Create warmup procedure for MediaWiki app servers.Aug 7 2019, 3:57 PM

Dzahn closed subtask T230144: Deploy scap 3.12.0-1 to production as Resolved.Aug 8 2019, 10:09 PM

Change 530014 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] scap: set flag for check-and-restart-php

https://gerrit.wikimedia.org/r/530014

In T224857#5413024, @gerritbot wrote:

Change 530014 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] scap: set flag for check-and-restart-php

https://gerrit.wikimedia.org/r/530014

The scap work here is done, merged, and deployed. Still need to update the scap.cfg in puppet to use the new functionality. Specifically need to set: php_fpm_restart_script: /usr/local/sbin/check-and-restart-php, php_fpm: php7.2-fpm, and the default threshold is currently 100 so unless that should be different, it can remain unset for the moment. I took a crack at adding these items to the config in 530014

Change 531475 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki::common: Allow mwdeploy user to restart php-fpm

https://gerrit.wikimedia.org/r/531475

Change 531475 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki::common: Allow mwdeploy user to restart php-fpm

https://gerrit.wikimedia.org/r/531475

Change 530014 merged by Effie Mouzeli:
[operations/puppet@production] scap: set flag for check-and-restart-php

https://gerrit.wikimedia.org/r/530014

@thcipriani When a deployer run scap pull on mwdebug, they got sudo: [/usr/local/sbin/check-and-restart-php,: command not found (notice the extra comma), so we rolledback https://gerrit.wikimedia.org/r/531475

Change 532828 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php-fpm: sudo_check_call takes a string

https://gerrit.wikimedia.org/r/532828

Change 533782 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[mediawiki/tools/scap@master] Fix restart_php

https://gerrit.wikimedia.org/r/533782

thcipriani edited projects, added Release-Engineering-Team-TODO (201909); removed Release-Engineering-Team-TODO (201908).Sep 3 2019, 2:42 PM

thcipriani moved this task from INBOX to Doing on the Release-Engineering-Team-TODO (201909) board.Sep 3 2019, 2:44 PM

Change 532828 abandoned by Thcipriani:
php-fpm: sudo_check_call takes a string

Reason:
Abandon for Ib1f6ca866d423519036355358fef6382404254fe

https://gerrit.wikimedia.org/r/532828

Change 533782 merged by jenkins-bot:
[mediawiki/tools/scap@master] Fix restart_php

https://gerrit.wikimedia.org/r/533782

In T224857#5462415, @gerritbot wrote:

Change 533782 merged by jenkins-bot:
[mediawiki/tools/scap@master] Fix restart_php

https://gerrit.wikimedia.org/r/533782

Change is live in beta now

@thcipriani should we create a new package/release?

In T224857#5463992, @Joe wrote:

@thcipriani should we create a new package/release?

yes!

The scap repo has tags for [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/debian/3.12.1-1 | debian/3.12.1-1 ]] on the release branch.

Instructions for updating are on Wikitech.

Here is the full changelog for this release:

debian/changelog

scap (3.12.1-1) unstable; urgency=low
  [ Lucas Werkmeister ]
  * Remove l10n-update from manpage
  [ Giuseppe Lavagetto ]
  * Fix restart_php (Bug: T224857)
 -- Tyler Cipriani <tcipriani@wikimedia.org>  Wed, 04 Sep 2019 14:31:47 -0600

I did some tests, and we still have one problem with scap pull:

It is run as a common user (e.g. foo)
It runs commands via the mwdeploy user, that sudos to root to run the check-restart command. However, due to how conftool works, credentials get searched in the home of the user launching the request

So it fails to depool/restart/pool the servers because of insufficient credentials being available.

OTOH, @hashar and I tested scap sync-wikiversions and it correctly restarted remotely the application servers.

So for now I'm going to readd the directives to restart php7.2 only on the deployment hosts to scap.cfg, and we will have time to find a proper solution.

Change 534584 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] scap: restart php-fpm if needed when doing a full deploy

https://gerrit.wikimedia.org/r/534584

Mentioned in SAL (#wikimedia-operations) [2019-09-05T10:28:04Z] <_joe_> upgrading scap across the fleet T224857

Change 534584 merged by Giuseppe Lavagetto:
[operations/puppet@production] scap: restart php-fpm if needed when doing a full deploy

https://gerrit.wikimedia.org/r/534584

This is now 99% done. We just need a confctl release to be able to make scap pull work as intended. Resolving this task though, as the work on the mw deployment side has been completed.

Jdforrester-WMF moved this task from Doing to Completed on the Release-Engineering-Team-TODO (201909) board.Sep 20 2019, 8:43 PM

thcipriani mentioned this in T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers.Mar 3 2020, 1:07 AM

Joe mentioned this in T313770: scap no longer restarts php-fpm on canary servers.Jul 26 2022, 12:33 PM

	Joe
	Jun 3 2019, 7:05 AM

Enhance MediaWiki deployments for support of php7.xClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Enhance MediaWiki deployments for support of php7.x
Closed, ResolvedPublic
Actions

Related Objects
Search...