The outcome of T253673, is to go with idea 3 - rolling restarts.
Work:
- Patch Scap to do the rolling restart always (instead of only every once in a while, currently based on unreliable opcache thresholds) -- https://gerrit.wikimedia.org/r/631776
- Patch Scap to implement an emergency flag that will perform this restart in a way that does not ensure live server capacity, in case of a bad patch having taken down the site in large part. Tracked by T243009: Add option in Scap to restart php-fpm for emergency deployments, and skip depooling/pooling servers
- Deploy updated Scap to Beta Cluster -- https://integration.wikimedia.org/ci/job/scap-beta-deb/118/
- Package Scap for production aptitude.
- Deploy updated Scap to production.
- Measure timing before and after, and report on task.
- Test results:
- With php_fpm_always_restart: false, scap sync-file README takes 1m19.942s
- With php_fpm_always_restart: true, scap sync-file README takes 3m12.836s
- Set opcache.validate_timestamps=0 on prod canaries.
* [ ] Set opcache.validate_timestamps=0 in Beta Cluster app servers that do php-fpm restarts.
- Set opcache.validate_timestamps=0 on all appserver clusters that do php-fpm restarts.
Out of scope for this task:
- Deal with long-running jobs and maintenance scripts.