Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc)
Closed, ResolvedPublic

Description

During a switchover, we want to clean up the caches in the inactive datacenter to avoid for stale data. As a consequence, we need to warm up the caches in the inactive dc after the wipe, to avoid various inconveniences.

Last time we did a switchover it was done using apache-fast-test and a list of URLs that @Krinkle prepared. The issue is that apache-fast-test can perform the requests in parallel on each server (which is needed for populating APC), but only does that sequentially. If those urls take significant time to render (and they do) the total time taken for the warmup can be quite long, in the order of 15 minutes. Since every HHVM server can sustain much more than one request in parallel, we'd need a warmup tool that does the following:

  • perform the minimum possible number of requests to warm up the caches satisfactorily
  • possibly different list of URLs for the API servers and normal appservers, imagescalers, etc.
  • be able to run the list of requests in parallel for each server
  • be able to run the list of requests with a configurable concurrency on each server

We can then experiment with this tool to find the concurrency that allows us to warm up the caches in the minimum amount of time.

Joe created this task.Feb 1 2017, 4:11 PM
Krinkle added a comment.EditedFeb 1 2017, 6:31 PM

Initial sketch for the warmup of Memcached (cluster-wide) and APC (per-server).

  • 3-5 urls for each of the 750 public wikis (not private, fishbowl or closed) - page views and load.php (not sure there is anything for api or thumb): ~3,800 urls in total.
  • Orchestrate across all servers in the cluster.
  • Configure concurrency of urls to request simultaneously on one server.
  • Configure concurrency of servers to hit simultaneously. (all?)
  • Configure global max concurrency of requests in transit (to avoid overloading common infra beyond natural traffic such as External store)

Requirements:

  • Where to fetch a list of current app servers in the cluster? (conf.d?)
  • Where to fetch a list of canonical server names for all public wikis, including the mobile variant? (sitematrix? + maybe hardcode wgMobileUrlTemplate from wmf-config/InitialiseSettings.php)
  • Typical request rate for an app server? (cursory research shows 15-50 req/s per P4863)
  • Typical cluster-wide request rate for mw backends? (2000-6000req/s)

URLs:

warm-urls-cluster.txt (5)
# Purpose: Root redirect
https://%server/
# Purpose: Main Page, Skin sidebar, Localisation cache
https://%server/wiki/Main_Page
# Purpose: MobileFrontend, Main Page
https://%mobileServer/wiki/Main_Page
# Purpose: Login page
https://%server/wiki/Special:UserLogin
# Purpose: API, Recent changes
https://%server/w/api.php?format=json&action=query&list=recentchanges
warm-urls-server.txt (3)
# Purpose: APC for ResourceLoader
https://%server/w/load.php?debug=false&modules=startup&only=scripts
https://%server/w/load.php?debug=false&modules=jquery%2Cmediawiki&only=scripts
https://%server/w/load.php?debug=false&modules=site%7Csite.styles

The total run should take less than 3 minutes. Estimates:

  • global (warm-urls-cluster.txt): 750 public wikis * 5 urls / 500 rate = 7.5s
    • A local test against text-lb shows latency will average between 500ms and 7s. Runtime varied from 30s to 45s.
  • per-server (warm-urls-server.txt): 750 public wikis * 3 urls / 50 rate = 45s = 1min
    • A local test against mwdebug1001 shows latency will average between 150ms and 14s. Runtime varied from 48s to 55s.
    • Using a concurrency of 45 per server, means global concurrency may peak around 5500 req/s.

Source code: https://gist.github.com/Krinkle/dfdcecb094570f1df520c76fc4630e56

Volans added a subscriber: Volans.Feb 1 2017, 6:32 PM
Volans added a comment.Feb 1 2017, 6:43 PM

For the requirements it might be helpful to shuffle the list of wiki/urls for each server to not call the same URL from different servers at the same time. My 2 cents.

Krinkle claimed this task.Feb 1 2017, 7:59 PM
Gilles moved this task from Inbox to Next-up on the Performance-Team board.Feb 1 2017, 8:03 PM

thanks @Krinkle !
I have some questions mostly due to my ignorance of what mw does with memcache, if we were to wipe the caches in codfw say today without touching mw config and run the warmup script against codfw, would that be a realistic test of what would happen during the switchover in terms of performance?
Also if I understand correctly the warmup is one of the things that bounds our read-only time during the switchover, in that case we could start warming up wikis sorted by e.g. their pageviews to further shorten the acceptable read-only time.

Warning: I am swapping all the mc2* codfw hosts with new hardware in T155755, I should complete the work in a couple of days.

Joe added a comment.Feb 2 2017, 11:23 AM

Correct me if I'm wrong, but I think the Main page call can be skipped for all non-standard-wiki-serving machines, so API, image/video scalers; also: do we really need to warm up APC for all of the wikis? Or could we target only the ones doing 99% of the traffic (which I guess are way less than that?).

warmup is one of the things that bounds our read-only time during the switchover, in that case we could start warming up wikis sorted by e.g. their pageviews to further shorten the acceptable read-only time.

That would significantly complicate the script as well as the actual switchover process. You'd have to deploy many changes to mw-config during the switchover to gradually read-only more and more wikis. The warmup script, meanwhile, takes less than a minute to run. I doubt we'd be reasonably saving any time considering the gradual read-only switching would have to be done manually and is about saving a subset of 50 seconds time.

Correct me if I'm wrong, but I think the Main page call can be skipped for all non-standard-wiki-serving machines, so API, image/video scalers; also: do we really need to warm up APC for all of the wikis? Or could we target only the ones doing 99% of the traffic (which I guess are way less than that?).

For video scalers, yes, maybe. But API and app servers both should have the main page query and the RC query. They are catch-all entry points that warm up a lot of shared resources that are not limited to page views or API queries.

Media scalers can be excluded from the entire process as none of the urls apply to those right now. I'll make sure to exclude those from the confd query if feasible. either way, though, the script is much better than last year and should run in under a minute regardless.

elukey added a comment.Feb 8 2017, 6:11 PM

Warning: I am swapping all the mc2* codfw hosts with new hardware in T155755, I should complete the work in a couple of days.

Completed today, restarted all the nutcrackers in codfw to pick up the change. Please note: after https://gerrit.wikimedia.org/r/#/c/335780 Nutcracker is not restarted anymore on config change (the problem arose when a change in the codfw pool triggered a restart in all the eqiad nutcrackers due to how the config is laid out).

Krinkle moved this task from Next-up to Doing on the Performance-Team board.Feb 8 2017, 8:12 PM
Joe added a comment.Feb 9 2017, 7:23 AM

Another interesting possibility we might want to explore:

HHVM allows you to define a script (and headers!) you want to execute as a warmup procedure. It could be interesting to integrate the APC warmup into such a script so that any HHVM server will start up in the future with APC properly populated.

Probably not matter for this quarter though?

warmup is one of the things that bounds our read-only time during the switchover, in that case we could start warming up wikis sorted by e.g. their pageviews to further shorten the acceptable read-only time.

That would significantly complicate the script as well as the actual switchover process. You'd have to deploy many changes to mw-config during the switchover to gradually read-only more and more wikis. The warmup script, meanwhile, takes less than a minute to run. I doubt we'd be reasonably saving any time considering the gradual read-only switching would have to be done manually and is about saving a subset of 50 seconds time.

Indeed, it seems a whole lot of effort for small gains over 50s. Do you know if we could simulate a warmup (and a wipe beforehand) in codfw given how it is configured today in mediawiki?

Joe added a comment.Feb 15 2017, 11:50 AM

warmup is one of the things that bounds our read-only time during the switchover, in that case we could start warming up wikis sorted by e.g. their pageviews to further shorten the acceptable read-only time.

That would significantly complicate the script as well as the actual switchover process. You'd have to deploy many changes to mw-config during the switchover to gradually read-only more and more wikis. The warmup script, meanwhile, takes less than a minute to run. I doubt we'd be reasonably saving any time considering the gradual read-only switching would have to be done manually and is about saving a subset of 50 seconds time.

Indeed, it seems a whole lot of effort for small gains over 50s. Do you know if we could simulate a warmup (and a wipe beforehand) in codfw given how it is configured today in mediawiki?

I think we should indeed do a few tests in codfw before the switchover. I guess we'll have to coordinate with the DBAs to be sure this won't harm codfw.

Krinkle moved this task from Doing to Backlog on the Performance-Team board.Feb 15 2017, 8:29 PM
Krinkle added a comment.EditedFeb 16 2017, 6:20 PM

Next steps:

  • Put node-warmup script in puppet.
  • Have it ensured on at least one host in eqiad and one in codfw.
    • Decide where to host this. I'm proposing terbium/wasat (maintenance script hosts).
    • We'll also need to ensure nodejs is installed on these hosts (no dependencies, just core nodejs v4 or later).
  • Decide where to host this. I'm proposing terbium/wasat (maintenance script hosts).

terbium is still on trusty, so the nodejs candidate is 0.10.25~dfsg2-2ubuntu1+wmf1, while wasat is a jessie hence 6.9.1~dfsg-1+wmf1 will be installed. I'd rather choose something more homogeneous unless terbium si due to be upgraded shortly. I'll ask for the status on T143536.

  • We'll also need to ensure nodejs is installed on these hosts (no dependencies, just core nodejs v4 or later).

On jessie hosts it would be 6.9.1~dfsg-1+wmf1 or later, so I'd say test it with this version to ensure there are no v4->v6 issues.

Change 339802 had a related patch set uploaded (by Krinkle):
[WIP] mediawiki: Add cache-warmup to maintenance

https://gerrit.wikimedia.org/r/339802

Joe added a comment.Feb 28 2017, 3:01 PM

I took what @Krinkle did in his patchset, fixed a couple of things in order to implement the "clone mode" and be able to simulate the full procedure:

  • I wiped all memcacheds in codfw
  • Restarted all HHVM appservers
  • Ran the script in spread mode: it took about 1.5 minutes to execute
  • Ran the script in clone mode only on the appservers in codfw; it took around 2 mintues to execute

This put some strain on the databases, but nothing really critical. For some reason that's not perfectly clear to me, the number of rows read from s3 is prominent, according to our database metrics.

Joe moved this task from Backlog to Doing on the User-Joe board.Feb 28 2017, 3:51 PM

Mentioned in SAL (#wikimedia-operations) [2017-02-28T21:05:50Z] <_joe_> manually installing nodejs on wasat T156922

Krinkle added a comment.EditedMar 1 2017, 12:16 AM
  • I wiped all memcacheds in codfw
  • Restarted all HHVM appservers
  • Ran the script in spread mode: it took about 1.5 minutes to execute
  • Ran the script in clone mode only on the appservers in codfw; it took around 2 mintues to execute

    This put some strain on the databases, but nothing really critical. For some reason that's not perfectly clear to me, the number of rows read from s3 is prominent, according to our database metrics.

I ran another test, without any Memcached wiping or HHVM restarting, though.

  • Spread mode for cluster urls (sent 3657 urls to codfw appservers; concurrency: 500): Took 26s and 12s (ran twice)
  • Clone mode for server urls (sent 2196 urls to 77 codfw appservers each; concurrency: 500/global, 100/server): Took 176s (2.9min) and 160s (2.6min)
  • Clone mode for server urls (sent 2196 urls to 77 codfw appservers each; concurrency: 2000/global, 100/server): Took 173s (2.8min)

On https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?var-dc=codfw%20prometheus%2Fops, I noticed nothing out of the ordinary except, same as @Joe noticed, "Rows read (s3)" spiked quite prominently (from 800 rps to 1.1M rps). The other clusters remained under 100 rps. The spike caused by our few thousand urls makes sense considering we're comparing to an idle state with only health status requests. Comparing to Eqiad, most of the "Rows read" metrics are regularly between 3M and 8M.

As for s3 in particular, it's where most wikis are, so that'll reflect most of the traffic. During our warmup we're sending an atypical proportion of traffic to the servers and databases. Normally the bigger wikis get more traffic than the smaller wikis, which is why most wikis can be on one cluster (s3). Since we're sending the same 8 urls to all (public) wikis, s3 receives significantly more traffic than the others for the short duration of the test.

On https://grafana.wikimedia.org/dashboard/file/server-board.json, looking at various app servers. CPU spiked from 0% to 8% for most codfw app servers. Compared to an average of 25% of most eqiad app servers. Load average jumped from 1 to 4 for a minute, compared to load average of 12 on eqiad servers. No increase in RAM usage.

On https://grafana.wikimedia.org/dashboard/db/hhvm-apc-usage?var-cluster=2, APC value size (codfw/p95) went from ~2MB to ~1GB, which is expected given that previously it only got 1 url of 1 wiki (enwiki Main Page health check), and now 8 urls for 750 wikis.

Still only a quarter of the typical APC value size of Eqiad (~4 GB), presumably since most cache keys are specific to users or pages, most of which don't get warmed up here, but that's fine.

Change 339802 merged by Giuseppe Lavagetto:
mediawiki: Add cache-warmup to maintenance

https://gerrit.wikimedia.org/r/339802

jcrespo added a subscriber: jcrespo.Mar 1 2017, 6:49 PM

I guess we'll have to coordinate with the DBAs to be sure this won't harm codfw.
@Joe noticed, "Rows read (s3)" spiked quite prominently (from 800 rps to 1.1M rps).

Please don't be worried about the main databases on codfw, like s3 -if they were fragile, they wouldn't withstand production queries :-). In fact, they throttle concurrency and put things in a queue if they cannot handle the thoughput. I checked the stats and the largest s3 server peak in connections was 95-79 concurrent connections, and we are ready for 5000-10000.

In fact, my only worry (why I mentioned before) was that the parsercache would affect eqiad because replication (writes on codfw->eqiad), but I haven't seen any issue with that.

As a reminder, main dbs did not have issues last time, without a proper warmup, only es* hosts did, because normally they have very, very little load due to caching, and they were briefly overloaded. The load there is much higher and there seems to create lag, which normally doesn't happen at all: https://grafana-admin.wikimedia.org/dashboard/db/mysql-aggregated?from=1488277850999&to=1488293698570&var-dc=codfw%20prometheus%2Fops&var-group=core&var-shard=All&var-role=All We want to warmup to avoid the es overload mainly. If those are happy, that should be enough.

Krinkle moved this task from Backlog to Next-up on the Performance-Team board.Mar 1 2017, 8:52 PM

Change 340539 had a related patch set uploaded (by Krinkle):
[operations/puppet] mediawiki-cache-warmup: Remove unused var, reduce concurrency, log slowest-5

https://gerrit.wikimedia.org/r/340539

Change 340539 merged by Giuseppe Lavagetto:
[operations/puppet] mediawiki-cache-warmup: Remove unused var, reduce concurrency, log slowest-5

https://gerrit.wikimedia.org/r/340539

Krinkle closed this task as "Resolved".Mar 6 2017, 7:42 PM
Krinkle moved this task from Next-up to Doing on the Performance-Team board.