Page MenuHomePhabricator

Make scap skip restarting php-fpm when using --force
Open, MediumPublic

Description

If there is a bad patch deployed that needs to be reverted immediately, the revert is more important than the checks for php-fpm's cache.

Using the --force flag should Just Work™ and deploy as fast as possible for all scap sync-* commands.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2020, 7:27 PM
thcipriani triaged this task as High priority.Jan 16 2020, 7:27 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.
LarsWirzenius added a subscriber: LarsWirzenius.
Krinkle added a subscriber: Krinkle.EditedMar 3 2020, 12:18 AM

Does the "maybe restart php-fpm" step take a noticable amount of time? I thought it was more or less instantaneous.

In any event, I'm not supportive of it being skipped because doing so makes our very fragile PHP7 deployment even harder to reason about or trust.

Our PHP7 deployment is hanging by a very thin thread due to the ability for opcache to get corrupted in way that it does not detect. See T224491 for examples. We have been very fortunate so far that the random "off by one" corruptions it has performed on primitive values such as string and bools, hasn't yet (as far as we know) caused loss of data or private data to become exposed but... I have absolutely no reason to assume that it won't lead to that. If we had come to this realisation earlier, I would've even recommended we stay on HHVM until this is fixed in better ways that we have done so far.

Note that despite our best efforts (e.g. predictive php-fpm restarts before we think opcache is about to shit itself...), we are still seeing occasional corruption from time to time (T245183). This wouldn't be too bad if it simply caused a segfault or other sane and fast failure, but as shown above that is not the case.

I understand that if a bad deploy happened we want to recover quickly, but can we quantify the number of seconds gained by skipping this (automated) step?

Does the "maybe restart php-fpm" step take a noticable amount of time? I thought it was more or less instantaneous.

The actual restart may be quick, but depooling the server prior to restart via conftool is slow.

How this works, currently:

  1. scap calls /usr/local/sbin/check-and-restart-php
  2. This is a script that calls the php7adm's opcache-info endpoint to dump the current size of the cache and determine if it's over the threshold in scap configuration
  3. Based on scap configuration, this script may decide to restart php, it does so via /usr/local/sbin/restart-php7.2-fpm which calls out to conftool's safe-service-restart.py script
  4. This script updates pybal and queries conftool in a loop until the update propagates to ensure that a service is depooled. Due to some limitation in pybal this is the slow step.

Additionally, we can't depool all of one type of server at once, so there are batches of server types that get depooled together. When we looked at this as part of the php7 upgrade (see T224857) I estimated that a sync-file would take ~11 minutes (instead of ~1minute) if we restarted all php-fpm for all appservers.

I have been persuaded that we need to restart php-fpm for every deployment (cf: T236104#5920394). However slow deployments become we need the ability to rollback fast.

The flip-side of cache corruption that may cause availability problems, data corruption, or security vulnerabilities is the need to rollback a change that is causing availability problems, data corruption, or security vulnerabilities -- that's the problem we're trying to solve with this change.

LarsWirzenius lowered the priority of this task from High to Medium.Mar 30 2020, 4:25 PM