Background
Following T221347 and T224491, the MW app servers are restarted more often (to avoid a dangerous and silent corruption bug that can affect live traffic). More details at T224857, but in short - we are trying to keep uptime as high as possible, whilst still trying to cut traffic before php7-opcache initiates its corruptive subroutines.
Keeping uptime high is preferred for operational comfort because restarting too often meant they are more often unavailable, thus reducing our capacity, as well as having to drain with slow requests etc., and other cascading issues.
But, there is an additional reason: Performance. MediaWiki can't live without caching. It depends on it.
Rationale
To make sure we are comfortable restarting servers more often, and as long-term preparation where MediaWIki app servers are contained spawned as-needed by automatically-scaled infrastructure. This means the average server might have a relatively short lifespan.
For that reality to have acceptable performance, we need to warm each server up before it starts responding to user requests.
Status quo
Over the past few years with HHVM, I've generally seen that on web servers (not considering job runners or api servers):
- HHVM will generally stay up for a week or more,
- APC has virtually no space limitation (it grows until it OOMs.. not great)
- APC will only remove values when they are expired, never due to space pressure.
When we first realised point 2 (sometimes in 2015?), Performance Team audited production MW code to look for any keys that don't have a TTL and make sure they all have a reasonable TTL (most things stay for minutes or hours, some things several days, and a small number of important things are allowed to last 2 weeks or a month).
Continuing from T224857#5258043 - We never planned to "depend" on server uptime or anything like that. And things might be quite alright if we restart every hour. I don't know...
What I do know is that, for years, we've monitored production and addressed latency problems as we see them, with the tools available to us (APCu, Memc, Redis), and always verify those fixes in production.
Reducing uptime will likely uncover cases where a feature was created and put something slow and important to compute only in APC we've never known about it, and never seen it cause a problem.
It'll also likely expose cases where we quickly fixed something with APCu, and didn't invest in extra complexity to also persist and coordinate the lifetime of the value somewhere central for backfilling APCu after a restart.
Prior art
When we worked on switchdc's "urls-server" procedure 2 years ago, we did most of the thinking already. The per-server warmup is primarily aimed at warming up PHP's APCu:
- T154658: Prepare and improve the datacenter switchover procedure
- T156922: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc)
- T160178: MediaWiki Datacenter Switchover automation
Plan
- Identity cached data in APCu that is expensive to compute and not stored in Memcached. Specifically those that are used by high-traffic entry points (e.g. page views, recent changes, load.php startup).
- Craft a minimal set of urls that'll warm it all up.
- Decide which wikis to run it for (eg. wmf-config's large.dblist? more? fewer?)
- Set a target for how long the warmup script may take (1 minute? more? less?)
- Develop the script, and iterate by testing out on a freshly restarted/depooled server, until we meet the target.