Page MenuHomePhabricator

Needed two warmup cache runs during June 2021 DC switchover
Closed, ResolvedPublic

Description

During the June 2021 DC switchover, we had to run the warmup cache cookbook twice. After the first time load.php was responding in ~3s, which people felt was too slow. After the second run load.php perf was basically at what eqiad was responding with and we proceeded.

We may need to put in some more load.php URLs in the warmup script or maybe adjust the convergence timing logic in the cookbook.

Event Timeline

Where does the 3s measure come from? Is that from the switchover script self-reporting that after trying the per-server.txt list in a loop, it stopped after N iterations when the latency went no longer than 3s? Or is that from something else? Do we know how many iterations it did, and why it stopped? (I assume it meant it started plateauing or fluctuating upward again.)

Is it for any specific wiki that stands out, or were the startup urls for all wikis that slow?

The loop itself is new this year afaik. Perhaps a local minimum would make sense (e.g. at least three times or something like that).

On the other hand, all load.php-specific caches should be warm after the first run since they just go straight into php-apcu and memcached. I'm not sure what would be different between the 2nd and 10th request in that regard.

I assume the per-server stuff runs after the per-cluster stuff, right?

Where does the 3s measure come from? Is that from the switchover script self-reporting that after trying the per-server.txt list in a loop, it stopped getting after N iterations when the latency went no longer than 3s? Or is that from something else?

I think it was from mediawiki-cache-warmup's output.

Do we know how many iterations it did, and why it stopped? (I assume it meant it started plateauing or fluctuating upward again.)

3 times, see

12021-06-29 14:09:01,800 jayme 27247 [INFO] START - Cookbook sre.switchdc.mediawiki.00-warmup-caches
22021-06-29 14:09:18,610 jayme 27247 [INFO] Running warmup script in codfw.
32021-06-29 14:09:18,610 jayme 27247 [INFO] The script will re-run until execution time converges.
42021-06-29 14:09:18,611 jayme 27247 [INFO] Running warmup script, take 1
52021-06-29 14:09:18,612 jayme 27247 [INFO] Executing commands [cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'), cumin.transpor
6ts.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'), cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warm
7up/urls-server.txt clone api_appserver codfw')] on '1' hosts: mwmaint2002.codfw.wmnet
82021-06-29 14:09:21,389 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'
92021-06-29 14:09:21,391 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/....svc.codfw.wmnet'.
102021-06-29 14:09:28,779 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'
112021-06-29 14:09:28,781 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/... appserver codfw'.
122021-06-29 14:09:36,095 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api_appserver codfw'
132021-06-29 14:09:36,097 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/..._appserver codfw'.
142021-06-29 14:09:36,097 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
152021-06-29 14:09:36,098 jayme 27247 [INFO] Warmup completed in 0:00:17.487198
162021-06-29 14:09:36,099 jayme 27247 [INFO] Running warmup script, take 2
172021-06-29 14:09:36,101 jayme 27247 [INFO] Executing commands [cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'), cumin.transpor
18ts.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'), cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warm
19up/urls-server.txt clone api_appserver codfw')] on '1' hosts: mwmaint2002.codfw.wmnet
202021-06-29 14:09:37,979 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'
212021-06-29 14:09:37,981 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/....svc.codfw.wmnet'.
222021-06-29 14:09:44,926 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'
232021-06-29 14:09:44,927 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/... appserver codfw'.
242021-06-29 14:09:51,891 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api_appserver codfw'
252021-06-29 14:09:51,892 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/..._appserver codfw'.
262021-06-29 14:09:51,893 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
272021-06-29 14:09:51,893 jayme 27247 [INFO] Warmup completed in 0:00:15.794661
282021-06-29 14:09:51,894 jayme 27247 [INFO] Running warmup script, take 3
292021-06-29 14:09:51,896 jayme 27247 [INFO] Executing commands [cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'), cumin.transpor
30ts.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'), cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warm
31up/urls-server.txt clone api_appserver codfw')] on '1' hosts: mwmaint2002.codfw.wmnet
322021-06-29 14:09:53,688 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'
332021-06-29 14:09:53,689 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/....svc.codfw.wmnet'.
342021-06-29 14:10:01,466 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'
352021-06-29 14:10:01,468 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/... appserver codfw'.
362021-06-29 14:10:09,677 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api_appserver codfw'
372021-06-29 14:10:09,679 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/..._appserver codfw'.
382021-06-29 14:10:09,680 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
392021-06-29 14:10:09,680 jayme 27247 [INFO] Warmup completed in 0:00:17.786299
402021-06-29 14:10:09,680 jayme 27247 [INFO] Execution time converged, warmup complete.
412021-06-29 14:10:09,681 jayme 27247 [INFO] END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0)
422021-06-29 14:11:26,375 jayme 27247 [INFO] START - Cookbook sre.switchdc.mediawiki.00-warmup-caches
432021-06-29 14:11:43,303 jayme 27247 [INFO] Running warmup script in codfw.
442021-06-29 14:11:43,303 jayme 27247 [INFO] The script will re-run until execution time converges.
452021-06-29 14:11:43,303 jayme 27247 [INFO] Running warmup script, take 1
462021-06-29 14:11:43,305 jayme 27247 [INFO] Executing commands [cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'), cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'), cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api_appserver codfw')] on '1' hosts: mwmaint2002.codfw.wmnet
472021-06-29 14:11:45,684 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'
482021-06-29 14:11:45,686 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/....svc.codfw.wmnet'.
492021-06-29 14:11:53,968 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'
502021-06-29 14:11:53,969 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/... appserver codfw'.
512021-06-29 14:12:02,069 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api_appserver codfw'
522021-06-29 14:12:02,070 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/..._appserver codfw'.
532021-06-29 14:12:02,071 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
542021-06-29 14:12:02,072 jayme 27247 [INFO] Warmup completed in 0:00:18.768034
552021-06-29 14:12:02,072 jayme 27247 [INFO] Running warmup script, take 2
562021-06-29 14:12:02,074 jayme 27247 [INFO] Executing commands [cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'), cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'), cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api_appserver codfw')] on '1' hosts: mwmaint2002.codfw.wmnet
572021-06-29 14:12:03,996 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'
582021-06-29 14:12:03,997 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/....svc.codfw.wmnet'.
592021-06-29 14:12:11,894 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'
602021-06-29 14:12:11,896 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/... appserver codfw'.
612021-06-29 14:12:19,353 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api_appserver codfw'
622021-06-29 14:12:19,355 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/..._appserver codfw'.
632021-06-29 14:12:19,355 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
642021-06-29 14:12:19,355 jayme 27247 [INFO] Warmup completed in 0:00:17.283323
652021-06-29 14:12:19,355 jayme 27247 [INFO] Running warmup script, take 3
662021-06-29 14:12:19,357 jayme 27247 [INFO] Executing commands [cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'), cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'), cumin.transports.Command('nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api_appserver codfw')] on '1' hosts: mwmaint2002.codfw.wmnet
672021-06-29 14:12:21,463 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-cluster.txt spread appservers.svc.codfw.wmnet'
682021-06-29 14:12:21,464 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/....svc.codfw.wmnet'.
692021-06-29 14:12:28,138 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone appserver codfw'
702021-06-29 14:12:28,139 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/... appserver codfw'.
712021-06-29 14:12:36,229 jayme 27247 [INFO] Completed command 'nodejs /var/lib/mediawiki-cache-warmup/warmup.js /var/lib/mediawiki-cache-warmup/urls-server.txt clone api_appserver codfw'
722021-06-29 14:12:36,230 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'nodejs /var/lib/..._appserver codfw'.
732021-06-29 14:12:36,230 jayme 27247 [INFO] 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
742021-06-29 14:12:36,230 jayme 27247 [INFO] Warmup completed in 0:00:16.874618
752021-06-29 14:12:36,230 jayme 27247 [INFO] Execution time converged, warmup complete.
762021-06-29 14:12:36,231 jayme 27247 [INFO] END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0)

Unfortunately I don't think the output itself was logged.

Is it for any specific wiki that stands out, or were the startup urls for all wikis that slow?

I don't remember, sorry.

I assume the per-server stuff runs after the per-cluster stuff, right?

Yes.

Sidenote, it's very easy to test the warmup process in eqiad/passive DC with the --live-test mode so we can iterate on it without needing to wait for the switch back.

Outcome of today's meeting:

  • We ran the warmup recipe against Eqiad and it did 3 takes, with the last two takes taking around 18 seconds, of which 5s for the per-appserver tasks, where the average average per-server latency was still 1.5s.
  • Running the recipe again did not make the latencies better, still around 18s for the task, and 1.5s on the average latency.

When using curl to make some of the requests manually it was consistently around 0.2s and 0.3s seconds, same when run against a live Codfw server. We could not reproduce any slow request in isolation. But, running the warmup script a few more times did consistenly take around 18s with 10 of the 80 requests taking 2 seconds or more. So it would seem the slow requests happen consistently happening under warmup load, but not outside.

The warmup load does not seem significantly high. We send 80 requests concurrently to a server, which is comparable to normal production off-peak load (e.g. mw254.codfw currently has 60-80 req/s with p95 around 800ms).

Some thoughts:

  • Maybe the cache is refusing to warm up for some reason. We noticed that the apcu memory use did not increase between the server being cold and during/after the warmup. In fact, apcu use went down several notches with each run (Grafana dashboard). This was very confusing and we double-checked that it is indeed computed correctly (usage not free), and that it was not reset in-between. We wanted to dump detailed APCu metadata but this the php7adm server function for this appears to be broken currently.
  • The fact that latency of one-off requests manually with cURL were fast while warmup remained slow suggested that perhaps the servers were already sufficiently warm. However looking at the overall percentile distribution during the warmup shows clearly that this is not the case (there are lots of slow requests, not just a few outliers). But, I would like to actually run the warmup script against a known-warm server to compare against. Kunal and I will depool a server next week in Codfw (now primary) and run it isolated against that without any restart between depool and "warmup" to confirm for sure how it performs on a server that has been up for a while with live traffic. I'm expecting the average latency to be far lower there (closer to 100ms, not 1.5 seconds).

Next steps:

  • T288866: Kunal to fix the apcu dump command in the php7adm server.
  • T288867: Timo to start rewriting mw-warmup in Python and add a "single host" command (e.g. instead of running against a whole cluster).
  • (This task) Kunal to add a minimum loop count (e.g. at least 6 takes before the cookbook evaluates the latency back-off logic, which is basically what we did last time with 2x3 takes. While we're not consistently seeing that this is useful, it seems 6 takes runs fast enough that it shouldn't matter one way or the other and can always be reduced again later if we need to).

Change 707457 had a related patch set uploaded (by Legoktm; author: Legoktm):

[operations/cookbooks@master] sre.switchdc.mediawiki: Run the warmup cache script at least 6 times

https://gerrit.wikimedia.org/r/707457

Change 707457 merged by jenkins-bot:

[operations/cookbooks@master] sre.switchdc.mediawiki: Run the warmup cache script at least 6 times

https://gerrit.wikimedia.org/r/707457