(My understanding of this problem is still inexact, so my language is imprecise. Feel free to edit this task as we gain some clarity.)
Problem statement
- The broader the scope, the harder it is for a JIT to reason about invariants and to optimize code.
- HHVM does not deal with code in file scope all that well. It doesn't optimize it thoroughly, it translates it into the cold translation cache, and it does not always pick up code changes in top-scope.
- We have a lot of code in file scope. InitialiseSettings.php alone is five times longer than the Bhagavad-Gita.
- HHVM's translation cache does not have an eviction mechanism, or its eviction mechanism does not work for the cold cache, or it has some unspecified bug.
Consequences of the above for us are:
- We don't benefit from HHVM as much as we could, because so much of our code is in file scope. (This is a relatively minor issue.)
- Certain code changes require that HHVM be restarted before the change is picked up. Changes to StartProfile.php seem particularly susceptible to this.
- Changes to code in file scope are more likely than other changes to lead to a sharp increase in the size of the translation cache. When the translation cache is full, HHVM restarts with a SIGABRT, aborting any current requests, causing an outage until enough servers have recovered.
Plan of action
Short-term
- Make it possible to restart HHVM on all app servers without running scap, as @bd808 proposed in T103008: Scap should restart HHVM.
- Do this at least once on any day in which deployments occur.
- Avoid deploying changes to StartProfile.php and wikitech.php in quiet hours.
Mid-term
- Ensure that these issues are reported upstream and poke the HHVM team periodically for status updates.
- Iterate on the graceful restart procedure until it no longer generates alerts or spikes of 5xxs.
Long term
- Stop using file scope for configuration. It's not 1998 (T28992: Implement configuration database aka configuration management aka no shell excuse).
- RepoAuth.