HHVM was not designed to continue running indefinitely while its code is periodically updated by successive deploys, but that is how we run it currently. The consequence of this is that HHVM's translation cache is eventually exhausted and HHVM crashes.
My long-term view on how we ought to solve this involves switching to RepoAuth mode, with all the changes to our deployment tooling and process that this entails.
Until that time, we should make scap restart HHVM on each application server.
It should work like this:
- Sync all files to all app servers, as we do now.
This means that HHVM will start translating and executing the new code before it is restarted. That's not great, but doing it differently would make this a much bigger task.
- Send SIGWINCH to Apache on each app server to trigger a graceful stop. Wait for Apache to shut down.
- Restart HHVM.
- Start Apache.
Steps 2-4 would have to be staggered such that they only apply to a portion of the application server pool at a time. Simply setting a strict concurrency limit for the restart procedure in scap should do the trick.