Page MenuHomePhabricator

Create warmup procedure for MediaWiki app servers
Open, Needs TriagePublic

Description

Background

Following T221347 and T224491, the MW app servers are restarted more often (to avoid a dangerous and silent corruption bug that can affect live traffic). More details at T224857, but in short - we are trying to keep uptime as high as possible, whilst still trying to cut traffic before php7-opcache initiates its corruptive subroutines.

Keeping uptime high is preferred for operational comfort because restarting too often meant they are more often unavailable, thus reducing our capacity, as well as having to drain with slow requests etc., and other cascading issues.

But, there is an additional reason: Performance. MediaWiki can't live without caching. It depends on it.

Rationale

To make sure we are comfortable restarting servers more often, and as long-term preparation where MediaWIki app servers are contained spawned as-needed by automatically-scaled infrastructure. This means the average server might have a relatively short lifespan.

For that reality to have acceptable performance, we need to warm each server up before it starts responding to user requests.

Status quo

Over the past few years with HHVM, I've generally seen that on web servers (not considering job runners or api servers):

  1. HHVM will generally stay up for a week or more,
  2. APC has virtually no space limitation (it grows until it OOMs.. not great)
  3. APC will only remove values when they are expired, never due to space pressure.

When we first realised point 2 (sometimes in 2015?), Performance Team audited production MW code to look for any keys that don't have a TTL and make sure they all have a reasonable TTL (most things stay for minutes or hours, some things several days, and a small number of important things are allowed to last 2 weeks or a month).

Continuing from T224857#5258043 - We never planned to "depend" on server uptime or anything like that. And things might be quite alright if we restart every hour. I don't know...

What I do know is that, for years, we've monitored production and addressed latency problems as we see them, with the tools available to us (APCu, Memc, Redis), and always verify those fixes in production.

Reducing uptime will likely uncover cases where a feature was created and put something slow and important to compute only in APC we've never known about it, and never seen it cause a problem.

It'll also likely expose cases where we quickly fixed something with APCu, and didn't invest in extra complexity to also persist and coordinate the lifetime of the value somewhere central for backfilling APCu after a restart.

Prior art

When we worked on switchdc's "urls-server" procedure 2 years ago, we did most of the thinking already. The per-server warmup is primarily aimed at warming up PHP's APCu:

Plan
  • Identity cached data in APCu that is expensive to compute and not stored in Memcached. Specifically those that are used by high-traffic entry points (e.g. page views, recent changes, load.php startup).
  • Craft a minimal set of urls that'll warm it all up.
  • Decide which wikis to run it for (eg. wmf-config's large.dblist? more? fewer?)
  • Set a target for how long the warmup script may take (1 minute? more? less?)
  • Develop the script, and iterate by testing out on a freshly restarted/depooled server, until we meet the target.

Event Timeline

Krinkle created this task.Aug 7 2019, 3:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 7 2019, 3:57 PM
Krinkle updated the task description. (Show Details)Aug 7 2019, 3:58 PM
jijiki added a subscriber: jijiki.Aug 7 2019, 4:01 PM
Gilles assigned this task to dpifke.Jan 7 2020, 11:37 AM
Joe added a comment.Jan 17 2020, 7:28 AM

I am not convinced this is a great idea.

Warmup at restart makes all operating procedures more complex, and frankly the perfomance gain seems pretty minimal to me, and all to be demonstrated.

We're regularly restarting php-fpm for months now, and I don't see any demonstrable performance degradation caused by it.

I suggest we decline this task until there is a proven need for something like this.

Krinkle added a comment.EditedFeb 4 2020, 10:56 PM

During the first three switch overs the impact and gains was imho quite clearly proven given that on a cold server (at the time HHVM), latencies were in the dozens of seconds up to a minute even for load.php. After the warmup, responses were generally around 10-20ms with some 1-5s outliers due to unrelated reasons.

We're also currently in the midst of at least a dozen perf regressions, with small ones added almost every week, adding up. I have yet to learn that this isn't at least in part caused by the switch to PHP7 and/or the increased or continued use of app server restarts without MediaWiki's hot code paths being refactored to deal with such operational model. From what I can tell, it has at least in part reversed 2-3 years of optimisation work. There is no blame here and I fully agree such restarts should be tolerated, and it would certainly be nice not to need any warm ups. However I can say with certainty that warmups woud significantly shorten the tail of load.php latencies, and real-users from having to wait 60 seconds for the page to render e.g. when they are the first to request a given stylesheet after CDN cache miss and app server restart.