Page MenuHomePhabricator

Scap should restart HHVM
Closed, DuplicatePublic

Description

HHVM was not designed to continue running indefinitely while its code is periodically updated by successive deploys, but that is how we run it currently. The consequence of this is that HHVM's translation cache is eventually exhausted and HHVM crashes.

My long-term view on how we ought to solve this involves switching to RepoAuth mode, with all the changes to our deployment tooling and process that this entails.

Until that time, we should make scap restart HHVM on each application server.

It should work like this:

  1. Sync all files to all app servers, as we do now.

This means that HHVM will start translating and executing the new code before it is restarted. That's not great, but doing it differently would make this a much bigger task.

  1. Send SIGWINCH to Apache on each app server to trigger a graceful stop. Wait for Apache to shut down.
  1. Restart HHVM.
  1. Start Apache.

Steps 2-4 would have to be staggered such that they only apply to a portion of the application server pool at a time. Simply setting a strict concurrency limit for the restart procedure in scap should do the trick.

Event Timeline

ori created this task.Jun 18 2015, 8:40 PM
ori assigned this task to bd808.
ori raised the priority of this task from to Normal.
ori updated the task description. (Show Details)
ori added projects: HHVM, Deployments.
ori added a subscriber: ori.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 18 2015, 8:40 PM
ori renamed this task from scap should restart HHVM to Scap should restart HHVM.Jun 18 2015, 8:58 PM
ori set Security to None.
ori added subscribers: chasemp, thcipriani.
Joe added a subscriber: Joe.Jun 19 2015, 4:18 PM

In a couple of weeks, we will be able to pool/depool hosts from the load balancers with a call.

I think we should use that capability, so that steps 2-4 will be

  1. Depool the server from pybal, wait N seconds

3 Restart HHVM

  1. repool the server in pybal

We can do this pretty easily if we batch the pools/depools, but of course it will need some testing/further work.

(also, lemme add that the pool/depool system comes with a python library, so it can easily be integrated in scap).

Change 219751 had a related patch set uploaded (by BryanDavis):
Add HHVM restart support

https://gerrit.wikimedia.org/r/219751

thcipriani moved this task from To Triage to In-progress on the Deployments board.Jun 22 2015, 6:00 PM

Change 219751 merged by jenkins-bot:
Add HHVM restart support

https://gerrit.wikimedia.org/r/219751

greg awarded a token.Jun 22 2015, 8:48 PM
hashar moved this task from Backlog to New features on the HHVM board.Jun 23 2015, 10:01 PM

scap now has a --restart command line option that will run scap-hhvm-restart across the cluster. The scap-hhvm-restart script that runs on each host will:

  • Run apache2ctl shutdown-graceful
  • Run service hvvm restart
  • Run service apache2 start

When we get the etcd communication bits added and exposed from pybal we can replace the apache2 manipulation with direct etcd communication to de-pool the server that is being restarted.

This all needs more operational testing before we should switch --restart to be the default behavior. It would be trivial to add a new command that can be run from the deploy server that only does the restarts independent from a full scap if that is somehow useful.

@ori did some additional operational testing up to and including using --restart across the entire WMF prod cluster. The restart functionality worked but both icinga and external monitoring services paged during the restart.

Restricted Application added a subscriber: Matanya. · View Herald TranscriptJun 25 2015, 6:34 PM
bd808 moved this task from Needs Review/Feedback to Done on the User-bd808 board.Jul 11 2015, 9:40 PM
bd808 moved this task from Done to Archive on the User-bd808 board.Jul 24 2015, 5:18 PM