Page MenuHomePhabricator

Enhance MediaWiki deployments for support of php7.x
Closed, ResolvedPublic

Description

Our current deployment strategy for MediaWiki's code is as follows:

  • rsync files to servers
  • wait for the php interpreter to pick up the changes on the filesystem

This didn't work very well with HHVM, but it was working passably bad, so we never invested the time and resources to improve what we were doing. With php7, things are even worse:

  • PHP7 saves its opcode in the opcache, which stores opcode by filesystem path, not by inode like APCu did
  • opcache doesn't automatically detect changes on the filesystem, if not with a huge performance penalty
  • opcache can be reset upon release, but it causes all sorts of instabilities and dangerous inconsistencies (see T224491)
  • opcache can revalidate the content of files at regular intervals (what we're doing now)

Problems with our current approach are:

  • No atomicity on release, inconsistencies between different scripts can go as far as the whole revalidation frequency, as that's per-file and not global
  • Opcache can become full and might cause a huge performance hit.

Given the impossibility of reliably resetting the opcache, we're left with only one option: perform a rolling restart of the php-fpm daemons upon every release (except sync-file).

I see a few ways to go with this in the short term:

  • We implement the php/hhvm rolling restart cookbook for spicerack (dependent on https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/503947 being merged), and we set up scap to run it after all the code has been distributed
  • We implement a smart script on the appservers themselves that can be called by scap that will depool, restart and repool the appserver. We can make appservers wait in queue for their turn by using something like poolcounter.

In any case, it could be advisable to change the way we do SWATs: we might just merge change after change, test them on mwdebug, then deploy cluster-wide one single time.

It's clear this is not a long term solution in my mind. For that we'll have to radically rethink how we do code deploys.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 514660 merged by Giuseppe Lavagetto:
[operations/puppet@production] conftool::scripts: add a safe-service-restart script

https://gerrit.wikimedia.org/r/514660

Change 514695 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki: increase opcache space on canaries

https://gerrit.wikimedia.org/r/514695

Change 514695 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki: increase opcache space on canaries

https://gerrit.wikimedia.org/r/514695

Change 514706 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php: increase opcache everywhere

https://gerrit.wikimedia.org/r/514706

Change 514706 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: increase opcache everywhere

https://gerrit.wikimedia.org/r/514706

Change 514799 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php::monitoring: add checks for opcache status

https://gerrit.wikimedia.org/r/514799

Change 514799 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php::monitoring: add checks for opcache status

https://gerrit.wikimedia.org/r/514799

A small comment about what has been done and what needs to be done:

  • We upped opcache to 1 GB. By my evaluations, that should last about 1 week of deployments before being exhausted. Maybe less
  • We added a check for the opcache health
  • we created a script for safely restarting php-fpm that could be also used during deployments

and finally

  • we're almost ready to deploy a periodic job that willl restart automatically php-fpm when at low cache space

This will allow us to survive a little longer, even to unblock the migration maybe, but this ticket /needs/ a more permanent solution, which means we need either to rethink how we distribute code in production, and specifically I think the low-hanging fruit is to do rolling restarts upon each deployment. We can solve the pybal slowness on one side, and create a cumin script scap can launch after every sync (and be done in a couple minutes).

Regarding rollbacks: I guess that we can add a switch that, just for rollbacks, sends out the opcache reset first, and does the rolling restart afterwards, thus reducing the window of time in which the issues are present. We'll have a small window of time in which an opcache corruption would be possible, but only in case of emergency.

@thcipriani would that work for addressing your concerns?

Regarding rollbacks: I guess that we can add a switch that, just for rollbacks, sends out the opcache reset first, and does the rolling restart afterwards, thus reducing the window of time in which the issues are present. We'll have a small window of time in which an opcache corruption would be possible, but only in case of emergency.

@thcipriani would that work for addressing your concerns?

Checking my understanding of what you're saying (please correct me where I'm misunderstanding): for rollback/emergencies, scap does opcache_reset, syncs the file, then calls the smart script to do a rolling restart (11 minutes -- 350 servers 2 seconds/server). By sending the opcache_reset first means that the change goes live immediately, and the restart is just for cache sanity.

If the above is correct: I thought opcache_reset was never safe?

Other small problem with scap is that for MediaWiki there is no specific rollback command (and, unfortunately doesn't rollback automagically T225207: Enable scap to roll back broken changes to MediaWiki); rollback is just a deploy to a previous version.

Random thought about this: I remember when hhvm was new there was talk about using https://github.com/facebook/hhvm/blob/a13d8a3981/hphp/runtime/server/takeover-agent.cpp to do zero downtime deploys. I have no idea if there exists a program similar to this for transferring file descriptors between php-fpm instances?

Regarding rollbacks: I guess that we can add a switch that, just for rollbacks, sends out the opcache reset first, and does the rolling restart afterwards, thus reducing the window of time in which the issues are present. We'll have a small window of time in which an opcache corruption would be possible, but only in case of emergency.

@thcipriani would that work for addressing your concerns?

Checking my understanding of what you're saying (please correct me where I'm misunderstanding): for rollback/emergencies, scap does opcache_reset, syncs the file, then calls the smart script to do a rolling restart (11 minutes -- 350 servers 2 seconds/server). By sending the opcache_reset first means that the change goes live immediately, and the restart is just for cache sanity.

We'd sync the files, reset the opcache (which would actually make the rollback effective) and then do the rolling restart afterwards.

If the above is correct: I thought opcache_reset was never safe?

It's not, but between a marginal chance of having corruption on one machine, and having widespread errors, I would assume the former is preferrable.

Other small problem with scap is that for MediaWiki there is no specific rollback command (and, unfortunately doesn't rollback automagically T225207: Enable scap to roll back broken changes to MediaWiki); rollback is just a deploy to a previous version.

What we'd need is to add a cli flag to scap, --force, that would actually do the opcache reset.

Random thought about this: I remember when hhvm was new there was talk about using https://github.com/facebook/hhvm/blob/a13d8a3981/hphp/runtime/server/takeover-agent.cpp to do zero downtime deploys. I have no idea if there exists a program similar to this for transferring file descriptors between php-fpm instances?

Sadly no :(

Checking my understanding of what you're saying (please correct me where I'm misunderstanding): for rollback/emergencies, scap does opcache_reset, syncs the file, then calls the smart script to do a rolling restart (11 minutes -- 350 servers 2 seconds/server). By sending the opcache_reset first means that the change goes live immediately, and the restart is just for cache sanity.

We'd sync the files, reset the opcache (which would actually make the rollback effective) and then do the rolling restart afterwards.

Is the ~11 minutes accurate? Are there any changes that can be done from the pybal-side to speed that up?

Other small problem with scap is that for MediaWiki there is no specific rollback command (and, unfortunately doesn't rollback automagically T225207: Enable scap to roll back broken changes to MediaWiki); rollback is just a deploy to a previous version.

What we'd need is to add a cli flag to scap, --force, that would actually do the opcache reset.

Sure, or maybe prompt folks during scap sync-wikiversions so they don't have to remember a special flag for an emergency.

Are there any other paths forward beyond the depool/restart that may be able to keep deployments fast(ish)? Scap refactoring to avoid 11 minute file deploys seems preferable if it is at all possible.

I think I saw you post elsewhere this article about Etsy symlink swaps: https://codeascraft.com/2013/07/01/atomic-deploys-at-etsy/ That is, syncing to two different directories and swapping a symlink between them -- keeping two versions of the code in cache. From the scap side, this seems like something possible (though not trivial) to implement and test. Is this workable from the apache side?

It seems like the appservers have enough space for two copies of the code. This approach also has some added benefits/solves some longstanding bugs.

The problem is multifold:

  • revalidation removes atomicity. While this seems not to be an issue for us in the few releases we've had since moving away from opcache resets, I wouldn't trust this not being an issue when we do backports in swat, in particular. Given the way we do train deploys, those are mostly harmless.

While we do have a unique file path for each wikiversion (i.e., /srv/mediawiki/php-1.34.0-wmf.8 vs /srv/mediawiki/php-1.34.0-wmf.9) we still do rely on syncing a single file in mostly the same way as SWAT (wikiversions.php) to point each wiki to that directory.

Quick fly-by note from Performance perspective, current expectation and target of how MediaWiki has been optimised, is that the APC corpus survives at least 3-5 days.

More regular restarts than that should be coincided with an audit from us to confirm whether that would be expected to cause issues (especially in the many areas we don't have good monitoring on), and figure out alternate strategies as needed (e.g. more double-tiering through Memc, but that comes at a latency and code complexity cost as well).

Overall backend latencies, CPU load, or APC hit-miss ratios are not (yet) considered good proxies for this.

Quick fly-by note from Performance perspective, current expectation and target of how MediaWiki has been optimised, is that the APC corpus survives at least 3-5 days.

More regular restarts than that should be coincided with an audit from us to confirm whether that would be expected to cause issues (especially in the many areas we don't have good monitoring on), and figure out alternate strategies as needed (e.g. more double-tiering through Memc, but that comes at a latency and code complexity cost as well).

Overall backend latencies, CPU load, or APC hit-miss ratios are not (yet) considered good proxies for this.

The idea that we can't restart an appserver more often that 3-5 days sounds pretty strange at first sight, given the amount of traffic we get. Where is this figure taken from? Why? I'd need hard data to find this claim credible.

It might come as a surprise, but HHVM is constantly restarted on the api cluster, and has been known for years to incur the occasional segfault, making this assumption false even in the current context.

But my most important point is - having performance critically depend on a service not being restarted for days is operatively unacceptable, won't work if we want to do any form of autoscaling of capacity in the future, and needs to be fixed in the code.

Also, before building such implicit expectation in a software, it would be appreciated if the matter is discussed with SRE. We were never aware of such limitations, and we never operated according to it.

I think I saw you post elsewhere this article about Etsy symlink swaps: https://codeascraft.com/2013/07/01/atomic-deploys-at-etsy/ That is, syncing to two different directories and swapping a symlink between them -- keeping two versions of the code in cache. From the scap side, this seems like something possible (though not trivial) to implement and test. Is this workable from the apache side?

It seems like the appservers have enough space for two copies of the code. This approach also has some added benefits/solves some longstanding bugs.

The etsy solution is actually pretty bad, if you ask me. It relies on an apache module that forces you to use the prefork MPM, which is a performance bottleneck, and also quite antiquated.

I'd rather explore other similar avenues, but those are long/medium term solutions, and this would block the migration on the short term?

Anyways, next week I can work on doing some tests.

To summarize the situation a bit, we have 2 problems right now that need to be solved:

  1. Deploys are not atomic
  2. Opcache gets progressively exhausted by deploying code

Se can solve 1) using the symlink swapping and mod_realdoc or switching to nginx on the frontend, but the second problem can only be solved via regular resets or restarts of the opcache.

Given we don't trust the opcache resets at all at this point, the rolling restart of the daemons seem like the best option to solve both issues.

Change 518023 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki::php: increase opcache on jobrunners

https://gerrit.wikimedia.org/r/518023

Change 518023 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki::php: increase opcache on jobrunners

https://gerrit.wikimedia.org/r/518023

To summarize the situation a bit, we have 2 problems right now that need to be solved:

  1. Deploys are not atomic
  2. Opcache gets progressively exhausted by deploying code

Se can solve 1) using the symlink swapping and mod_realdoc or switching to nginx on the frontend, but the second problem can only be solved via regular resets or restarts of the opcache.

Given we don't trust the opcache resets at all at this point, the rolling restart of the daemons seem like the best option to solve both issues.

Should we be working to implement symlink swapping in scap? Sounds like it's currently a blocker for this task (although opcache exhaustion may be blocking both).

If so the key is that we need the realpath for each deployment needs to be unique, correct? Atomic deployments are a side effect of the unique real path?

Should we be working to implement symlink swapping in scap? Sounds like it's currently a blocker for this task (although opcache exhaustion may be blocking both).

If so the key is that we need the realpath for each deployment needs to be unique, correct? Atomic deployments are a side effect of the unique real path?

No, I don't think that's what we should focus on.

We want to solve both the atomicity problem and the opcache exhaustion problem, and the symlink swapping won't solve the latter, which is by far the more urgent matter.

As I said, rolling restarts on each deployment solve both problems, but for now given the doubts I've seen floating around we could find a middle ground as follows:

  • we only run the rolling restart when a signficant code path change happens, so when we run the train

this would unblock the php7 transition at the very least. Is that acceptable to you?

Change 518664 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] safe-service-restart: logging improvements

https://gerrit.wikimedia.org/r/518664

Change 518664 merged by Giuseppe Lavagetto:
[operations/puppet@production] safe-service-restart: logging improvements

https://gerrit.wikimedia.org/r/518664

Should we be working to implement symlink swapping in scap? Sounds like it's currently a blocker for this task (although opcache exhaustion may be blocking both).

If so the key is that we need the realpath for each deployment needs to be unique, correct? Atomic deployments are a side effect of the unique real path?

No, I don't think that's what we should focus on.

We want to solve both the atomicity problem and the opcache exhaustion problem, and the symlink swapping won't solve the latter, which is by far the more urgent matter.

As I said, rolling restarts on each deployment solve both problems, but for now given the doubts I've seen floating around we could find a middle ground as follows:

  • we only run the rolling restart when a signficant code path change happens, so when we run the train

this would unblock the php7 transition at the very least. Is that acceptable to you?

If the plan is to do a rolling restart when a significant code change happens then there's really only one sync per week where we make significant code changes -- the Tuesday sync; i.e., the sync that rolls out a new version to group0 wikis. Other syncs are changes to existing files, the addition of a small number of new files, or routing wikis to use new code rather than old code.

On Tuesdays servers will not run the new code during the initial scap sync, but at the end of sync-wikiversions when we rebuild wikiversions.php on each server.

Perhaps a rolling restart targeting only the Tuesday sync would be acceptable? A smart script that runs the nrpe check to determine if the opcache is almost full (if you think that would save any time) and does the other magic you've described (i.e., pybal fiddling, restart).

By doing the full restart only on the Tuesday sync, it would add some time to a long process (X minutes in addition to the 30 minute sync: not ideal, but not too terrible), but would allow rollback and SWAT to remain fast.

In scap this could be done by adding a flag to sync-wikiversions -- --[no]-cluster-restart if neither --cluster-restart or --no-cluster-restart is passed, then prompt Perform rolling cluster restart following wikiversions update? [Y/n] this should allow for script usage and provide instructions for folks not aware of the new flag's existence.

To the "deploys are not atomic" problem -- would solving that via symlink swaps exacerbate the opcache problem? That is, using a symlink swap would make every deployment a significant code change since every file (even when its unchanged) would have a unique realpath for every deployment.

Change 519207 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki::php: add daemon restart cronjob

https://gerrit.wikimedia.org/r/519207

Change 519208 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mediawiki: run the cron for php restarts everywhere

https://gerrit.wikimedia.org/r/519208

We could kick off a rolling restart also at the end of each SWAT window, just for good measure. Of course, that goes against @Krinkle's desire for keeping app servers running for > 3 days. Still waiting to hear more from him about the performance implications of restarting ~daily.

Change 519207 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki::php: add daemon restart cronjob

https://gerrit.wikimedia.org/r/519207

Change 519208 merged by Giuseppe Lavagetto:
[operations/puppet@production] mediawiki: run the cron for php restarts everywhere

https://gerrit.wikimedia.org/r/519208

Note: we should think about how this affects canary tests in scap.

I had a meeting with @thcipriani yesterday and we came up with the following contingency plan:

  • We will keep opcache revalidation on for the time being
  • At every deploy, scap will run in batches (see below) /usr/local/sbin/check-and-restart-php php7.2-fpm 100; this will:
    • check if the free opcache is above 100 MB, if so, it will just return immediately
    • If not, it will depool/verify/restart service/pool/verify before returning with an appropriate exit code
  • This will mean that for an average deployment, no restart will happen and the execution will last a few seconds
  • Whenver a rolling restart is needed, that will happen only on the servers where it's needed.

How will the restart be performed:

  • Across three dsh groups: appservers, api, and jobrunners in parallel (pretty much what we did for opcache invalidation), across the two datacenters in parallel
  • In batches as large as at most 10% of the server pool
  • If any of the tasks fail in a batch, stop the action for that pool, report an error

The tentative agreement is the execution will be implemented in scap directly. Particular care will probably need to be given to canaries - we might want to restart them before we check logstash?
An alternative approach proposed - using a dedicated installation of cumin for running the commands - wouldn't allow very fine-grained interaction and is thus considered a last resort at this point.

  • Whenver a rolling restart is needed, that will happen only on the servers where it's needed.

If that is so, are we going to remove the service restart timers we have now?

How will the restart be performed:

  • Across three dsh groups: appservers, api, and jobrunners in parallel (pretty much what we did for opcache invalidation), across the two datacenters in parallel
  • In batches as large as at most 10% of the server pool
  • If any of the tasks fail in a batch, stop the action for that pool, report an error

+1

An alternative approach proposed - using a dedicated installation of cumin for running the commands - wouldn't allow very fine-grained interaction and is thus considered a last resort at this point.

I agree, let's try less complex solutions first :)

  • Whenver a rolling restart is needed, that will happen only on the servers where it's needed.

If that is so, are we going to remove the service restart timers we have now?

I guess eventually we will. That's not the kind of solution we want to duct-tape and then depend upon long-term.

@thcipriani do you need more information from us? Any idea when work on scap will begin?

thcipriani moved this task from Needs triage to Services improvements on the Scap board.

@thcipriani do you need more information from us? Any idea when work on scap will begin?

I started working on this late last week (but forgot to update workboard tags). My goal is to have a patchset end of this week/early next week.

@thcipriani do you need more information from us? Any idea when work on scap will begin?

I started working on this late last week (but forgot to update workboard tags). My goal is to have a patchset end of this week/early next week.

That's great, thanks! Let me know if I can help.

Change 525117 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php7x: Add php_fpm module

https://gerrit.wikimedia.org/r/525117

Change 525118 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php7x: restart on 'scap pull'

https://gerrit.wikimedia.org/r/525118

Change 525119 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php7x: restart php-fpm after all sync operations

https://gerrit.wikimedia.org/r/525119

Change 525117 merged by jenkins-bot:
[mediawiki/tools/scap@master] php7x: Add php_fpm module

https://gerrit.wikimedia.org/r/525117

Change 525118 merged by jenkins-bot:
[mediawiki/tools/scap@master] php7x: restart on 'scap pull'

https://gerrit.wikimedia.org/r/525118

Change 525119 merged by jenkins-bot:
[mediawiki/tools/scap@master] php7x: restart php-fpm after all sync operations

https://gerrit.wikimedia.org/r/525119

Change 530014 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] scap: set flag for check-and-restart-php

https://gerrit.wikimedia.org/r/530014

Change 530014 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] scap: set flag for check-and-restart-php

https://gerrit.wikimedia.org/r/530014

The scap work here is done, merged, and deployed. Still need to update the scap.cfg in puppet to use the new functionality. Specifically need to set: php_fpm_restart_script: /usr/local/sbin/check-and-restart-php, php_fpm: php7.2-fpm, and the default threshold is currently 100 so unless that should be different, it can remain unset for the moment. I took a crack at adding these items to the config in 530014

Change 531475 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] mediawiki::common: Allow mwdeploy user to restart php-fpm

https://gerrit.wikimedia.org/r/531475

Change 531475 merged by Effie Mouzeli:
[operations/puppet@production] mediawiki::common: Allow mwdeploy user to restart php-fpm

https://gerrit.wikimedia.org/r/531475

Change 530014 merged by Effie Mouzeli:
[operations/puppet@production] scap: set flag for check-and-restart-php

https://gerrit.wikimedia.org/r/530014

@thcipriani When a deployer run scap pull on mwdebug, they got sudo: [/usr/local/sbin/check-and-restart-php,: command not found (notice the extra comma), so we rolledback https://gerrit.wikimedia.org/r/531475

Change 532828 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[mediawiki/tools/scap@master] php-fpm: sudo_check_call takes a string

https://gerrit.wikimedia.org/r/532828

Change 533782 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[mediawiki/tools/scap@master] Fix restart_php

https://gerrit.wikimedia.org/r/533782

Change 532828 abandoned by Thcipriani:
php-fpm: sudo_check_call takes a string

Reason:
Abandon for Ib1f6ca866d423519036355358fef6382404254fe

https://gerrit.wikimedia.org/r/532828

Change 533782 merged by jenkins-bot:
[mediawiki/tools/scap@master] Fix restart_php

https://gerrit.wikimedia.org/r/533782

Change 533782 merged by jenkins-bot:
[mediawiki/tools/scap@master] Fix restart_php

https://gerrit.wikimedia.org/r/533782

Change is live in beta now

@thcipriani should we create a new package/release?

@thcipriani should we create a new package/release?

yes!

The scap repo has tags for [[ https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/debian/3.12.1-1 | debian/3.12.1-1 ]] on the release branch.

Instructions for updating are on Wikitech.

Here is the full changelog for this release:

debian/changelog
scap (3.12.1-1) unstable; urgency=low
  [ Lucas Werkmeister ]
  * Remove l10n-update from manpage
  [ Giuseppe Lavagetto ]
  * Fix restart_php (Bug: T224857)
 -- Tyler Cipriani <tcipriani@wikimedia.org>  Wed, 04 Sep 2019 14:31:47 -0600

I did some tests, and we still have one problem with scap pull:

  • It is run as a common user (e.g. foo)
  • It runs commands via the mwdeploy user, that sudos to root to run the check-restart command. However, due to how conftool works, credentials get searched in the home of the user launching the request

So it fails to depool/restart/pool the servers because of insufficient credentials being available.

OTOH, @hashar and I tested scap sync-wikiversions and it correctly restarted remotely the application servers.

So for now I'm going to readd the directives to restart php7.2 only on the deployment hosts to scap.cfg, and we will have time to find a proper solution.

Change 534584 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] scap: restart php-fpm if needed when doing a full deploy

https://gerrit.wikimedia.org/r/534584

Mentioned in SAL (#wikimedia-operations) [2019-09-05T10:28:04Z] <_joe_> upgrading scap across the fleet T224857

Change 534584 merged by Giuseppe Lavagetto:
[operations/puppet@production] scap: restart php-fpm if needed when doing a full deploy

https://gerrit.wikimedia.org/r/534584

Joe moved this task from this.quarter 🍕 to Doing 😎 on the serviceops board.

This is now 99% done. We just need a confctl release to be able to make scap pull work as intended. Resolving this task though, as the work on the mw deployment side has been completed.