Page MenuHomePhabricator

Update Scap to perform rolling restart for all MW deploy
Open, MediumPublic

Description

The outcome of T253673, is to go with idea 3 - rolling restarts.

Work:

Event Timeline

Krinkle created this task.Oct 20 2020, 5:40 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 20 2020, 5:40 PM
  1. Do a restart for all deploys. Take the hit on deploy time and/or focus on ways to reduce it.

I would like to do this -- it seems like it would solve a long-tail of issues that we've whack-a-mole'd -- but I have reservations about it.

The current estimate for the Scap rolling restart is 15 minutes. Is there low hanging fruit for reducing this?

My major worry isn't that deploys themselves take 15 minutes -- it's that rollbacks would also take 15 minutes. Waiting 15 minutes for a bad deploy to propagate and then waiting 15 minutes for it to be fixed is problematic (cf: T244544)

@Joe said today that a rolling restart takes 5-10 minutes, not 15 minutes. […]

RE: Having an option to take a shortcut, […] not restarting equates to effectively not having deployed any code.

However if we want to have a shortcut that doesn't do things in a rolling way, but immediately […] e.g. if most requests are http 5xx anyway, and if not having appserver capacity is handled gracefully higher in the traffic stack, then maybe an option for a bigger more risky batch could make sense. […]

I checked with SRE, and it looks like should not re-purpose restart-php7.2-fpm and safe-service-restart.py here. […]

Open questions (I guess for @Joe):

  • Would "simply" restarting php-fpm directly abort on-going requests? I assume so, and we're fine with this. Just checking so that we can document this as expected.

Yes, restarting php-fpm will kill all in-flight requests

  • What would happen to requests routed to an appserver during the restart? Fail fast? If not, is there something at hand that you'd like to configure differently to make such reqs fail fast?

Yes, it would fail fast, and unless it's a POST, it will be retried by ATS.

  • Is it worth putting some kind of very crude batching or rolling into this forced/emergency restart mode? Or is fanning out everywhere as fast as we can fine?

I think the same batching we maintain for the non-force case should be fine.

The only difference is in that case we'd just have to run a different command, but that might need some permissions work.

Krinkle updated the task description. (Show Details)Oct 20 2020, 5:44 PM
thcipriani updated the task description. (Show Details)Oct 20 2020, 5:46 PM
thcipriani assigned this task to dancy.Oct 20 2020, 6:07 PM
thcipriani added subscribers: dancy, LarsWirzenius.

Assigning per brief discussion in IRC. @dancy feel free to loop in others (@LarsWirzenius will be helpful for actual release) as needed.

thcipriani triaged this task as Medium priority.Oct 20 2020, 6:07 PM
thcipriani moved this task from Needs triage to Debt on the Scap board.
Joe added a comment.Oct 23 2020, 10:46 AM

I added what I think is an outline of the work we still need to do in order to make this process safe and efficient in T243009#6574045

Change 635991 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] safe-service-restart: add optional poolcounter support

https://gerrit.wikimedia.org/r/635991

Change 635992 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] poolcounter: add client configuration classes

https://gerrit.wikimedia.org/r/635992

Change 635993 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::lvs::realserver: add ability to configure poolcounter for pools

https://gerrit.wikimedia.org/r/635993

Change 635994 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] restbase: add poolcounter support to safe-service-restart scripts

https://gerrit.wikimedia.org/r/635994

Change 636074 had a related patch set uploaded (by Ahmon Dancy; owner: Ahmon Dancy):
[operations/puppet@production] modules/scap/templates/scap.cfg.erb: Update php_fpm_restart_script

https://gerrit.wikimedia.org/r/636074

Change 635991 merged by Giuseppe Lavagetto:
[operations/puppet@production] safe-service-restart: add optional poolcounter support

https://gerrit.wikimedia.org/r/635991

Change 635992 merged by Giuseppe Lavagetto:
[operations/puppet@production] poolcounter: add client configuration classes

https://gerrit.wikimedia.org/r/635992

Change 635993 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::lvs::realserver: add ability to configure poolcounter for pools

https://gerrit.wikimedia.org/r/635993

Change 635994 merged by Giuseppe Lavagetto:
[operations/puppet@production] restbase: add poolcounter support to safe-service-restart scripts

https://gerrit.wikimedia.org/r/635994

Change 640928 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::lvs::realserver: use poolcounter for guarding service restarts

https://gerrit.wikimedia.org/r/640928

Change 640928 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::lvs::realserver: use poolcounter for guarding service restarts

https://gerrit.wikimedia.org/r/640928