Our current deployment strategy for MediaWiki's code is as follows:
- rsync files to servers
- wait for the php interpreter to pick up the changes on the filesystem
This didn't work very well with HHVM, but it was working passably bad, so we never invested the time and resources to improve what we were doing. With php7, things are even worse:
- PHP7 saves its opcode in the opcache, which stores opcode by filesystem path, not by inode like APCu did
- opcache doesn't automatically detect changes on the filesystem, if not with a huge performance penalty
- opcache can be reset upon release, but it causes all sorts of instabilities and dangerous inconsistencies (see T224491)
- opcache can revalidate the content of files at regular intervals (what we're doing now)
Problems with our current approach are:
- No atomicity on release, inconsistencies between different scripts can go as far as the whole revalidation frequency, as that's per-file and not global
- Opcache can become full and might cause a huge performance hit.
Given the impossibility of reliably resetting the opcache, we're left with only one option: perform a rolling restart of the php-fpm daemons upon every release (except sync-file).
I see a few ways to go with this in the short term:
- We implement the php/hhvm rolling restart cookbook for spicerack (dependent on https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/503947 being merged), and we set up scap to run it after all the code has been distributed
- We implement a smart script on the appservers themselves that can be called by scap that will depool, restart and repool the appserver. We can make appservers wait in queue for their turn by using something like poolcounter.
In any case, it could be advisable to change the way we do SWATs: we might just merge change after change, test them on mwdebug, then deploy cluster-wide one single time.
It's clear this is not a long term solution in my mind. For that we'll have to radically rethink how we do code deploys.