Page MenuHomePhabricator

Create warmup procedure for MediaWiki app servers
Closed, ResolvedPublic

Description

Background

Following T221347 and T224491, the MW app servers are restarted more often (to avoid a dangerous and silent corruption bug that can affect live traffic). More details at T224857, but in short - we are trying to keep uptime as high as possible, whilst still trying to cut traffic before php7-opcache initiates its corruptive subroutines.

Keeping uptime high is preferred for operational comfort because restarting too often meant they are more often unavailable, thus reducing our capacity, as well as having to drain with slow requests etc., and other cascading issues.

But, there is an additional reason: Performance. MediaWiki can't live without caching. It depends on it.

Rationale

To make sure we are comfortable restarting servers more often, and as long-term preparation where MediaWIki app servers are contained spawned as-needed by automatically-scaled infrastructure. This means the average server might have a relatively short lifespan.

For that reality to have acceptable performance, we need to warm each server up before it starts responding to user requests.

Status quo

Over the past few years with HHVM, I've generally seen that on web servers (not considering job runners or api servers):

  1. HHVM will generally stay up for a week or more,
  2. APC has virtually no space limitation (it grows until it OOMs.. not great)
  3. APC will only remove values when they are expired, never due to space pressure.

When we first realised point 2 (sometimes in 2015?), Performance Team audited production MW code to look for any keys that don't have a TTL and make sure they all have a reasonable TTL (most things stay for minutes or hours, some things several days, and a small number of important things are allowed to last 2 weeks or a month).

Continuing from T224857#5258043 - We never planned to "depend" on server uptime or anything like that. And things might be quite alright if we restart every hour. I don't know...

What I do know is that, for years, we've monitored production and addressed latency problems as we see them, with the tools available to us (APCu, Memc, Redis), and always verify those fixes in production.

Reducing uptime will likely uncover cases where a feature was created and put something slow and important to compute only in APC we've never known about it, and never seen it cause a problem.

It'll also likely expose cases where we quickly fixed something with APCu, and didn't invest in extra complexity to also persist and coordinate the lifetime of the value somewhere central for backfilling APCu after a restart.

Prior art

When we worked on switchdc's "urls-server" procedure 2 years ago, we did most of the thinking already. The per-server warmup is primarily aimed at warming up PHP's APCu:

Plan
  • Identity cached data in APCu that is expensive to compute and not stored in Memcached. Specifically those that are used by high-traffic entry points (e.g. page views, recent changes, load.php startup).
  • Craft a minimal set of urls that'll warm it all up.
  • Decide which wikis to run it for (eg. wmf-config's large.dblist? more? fewer?)
  • Set a target for how long the warmup script may take (1 minute? more? less?)
  • Develop the script, and iterate by testing out on a freshly restarted/depooled server, until we meet the target.

Event Timeline

I am not convinced this is a great idea.

Warmup at restart makes all operating procedures more complex, and frankly the perfomance gain seems pretty minimal to me, and all to be demonstrated.

We're regularly restarting php-fpm for months now, and I don't see any demonstrable performance degradation caused by it.

I suggest we decline this task until there is a proven need for something like this.

During the first three switch overs the impact and gains was imho quite clearly proven given that on a cold server (at the time HHVM), latencies were in the dozens of seconds up to a minute even for load.php. After the warmup, responses were generally around 10-20ms with some 1-5s outliers due to unrelated reasons.

We're also currently in the midst of at least a dozen perf regressions, with small ones added almost every week, adding up. I have yet to learn that this isn't at least in part caused by the switch to PHP7 and/or the increased or continued use of app server restarts without MediaWiki's hot code paths being refactored to deal with such operational model. From what I can tell, it has at least in part reversed 2-3 years of optimisation work. There is no blame here and I fully agree such restarts should be tolerated, and it would certainly be nice not to need any warm ups. However I can say with certainty that warmups woud significantly shorten the tail of load.php latencies, and real-users from having to wait 60 seconds for the page to render e.g. when they are the first to request a given stylesheet after CDN cache miss and app server restart.

Krinkle triaged this task as Low priority.
Krinkle added a subscriber: dpifke.

We are currently restarting and wiping php/fpm/opcache/apcu on a regular basis. This isn't great and I think there's room for improvement here.

Having said that, it's would be better if we "simply" don't need warmups and rely less solely local APC for really expensive stuff. Things happening toward that end:

With the above we'd reduce our exposure of timeouts to cases where the server is freshly restarted as a whole. E.g. after maintenance before repooling, or in a future that involves containers. For the general pooling process I think it's worth still looking into a brief warmup of sorts, also for e.g. Etcd (which surprisingly can fail in ways I wasn't aware of - T256900). But.. that's less urgent, so lowering priority on this for now in favour of the above two things.

Link to T240775: RFC: Support PHP 7.4 preload given this other one could be part of some warmup procedure.

I mention also PHP-PM, which is an equivalent of PHP-FPM, essentially with a common loading of PHP classes and services then a loop handling PSR-7 HTTP requests/answers. I quickly tried to handle MW requests within PHP-PM, but it’s not really possible and/or not useful regarding a performance gain. There would be a lot of refactoring of entry points to make it fully compatible, so nothing possible without a long-term refactoring.

Even if a warmup procedure per-se complicates operations, perhaps it could be better partitionned the initial MW loading to have more flexibility and be able to open the following options on an optional basis:

  1. 7.4-preload classes
  2. use PHP-PM
  3. use some warmup script

I know it doesn’t address the issue of heterogeneous deployments, which could weight in the decision.

Krinkle claimed this task.

We basically have this, and used for dc-switchovers. If and when we need it elsewhere (e.g. for PHP preload T240775, or for kubernetes pod warmup) we can work on generalising it then and there as needed.