Page MenuHomePhabricator

Verify our current wikikube capacity (in both DCs) can handle all our traffic
Closed, ResolvedPublic

Description

What?
Resource wise, wikikube resources in both DCs are mostly equal, thus I think we could consider running a preliminary test (ie a week before the switchover), where we depool all read mw-* traffic, one by one, from codfw, and see what happens.

Why?
Since the previous swithcover (March 2024): T357547, we have had a major change, and that is serving almost all our traffic from MW-on-K8s. Part of the switchover involves having the source DC completely depooled for a week. This is a low effort test, which we can perform in a controllable manner.

Why codfw?
Currently it is handling less traffic than eqiad, thus users will be less impacted

Event Timeline

Spent a bit of time thinking about this today.

If we think in terms of a desired target php-fpm worker utilization and look at total (cross-site) active-state worker, we can back out a "high percentile" (say, over the course of a week) peak desired replica count.

So let's look at something like sum(phpfpm_processes_total{app="mediawiki", deployment=~"mw-(api-(ext|int)|web)", state="active"}) by (deployment) (i.e., only considering our three largest deployments with active-active services).

If we take the result of the above query over the past week, sampled at 1m granularity, and assume 60% target utilization (and 8 workers per replica) we get something like:

deployment=mw-api-ext:  p95=313  p75=293  # current=206
deployment=mw-api-int:  p95=353  p75=324  # current=268
deployment=mw-web:      p95=498  p75=437  # current=302

as approximate target replica counts. So, that would require 388 additional replicas in total if we go for 60% at p95, or 278 at p75.

Let's say optimistically that we're good citizens and use less than or equal to our aggregate per-pod resource requests (this seems to be the case in practice), so we can say "if it schedules, it'll be fine."

In that case, 388 additional replicas translates into ~ 2018 CPU-s/s and 1.4 TiB of resources (uniformly assuming the memory request of mw-web, which is actually a bit higher than the other two due to the larger APCu size).

Per kubernetes-resources, that's totally fine in terms of memory. However, CPU is questionable: during the "real" switchover toward codfw, this would fit (while using a sizable majority of available headroom), but a pre-test scenario where the three large active-active services are depooled in codfw would not fit in eqiad (note: it seems odd that that codfw has ~ 1k more allocatable CPU - I've not dug into that yet).

If we shift to a 70% target, then it would just barely fit in eqiad. If we instead pick p75, then it would also fit. FWIW, over this same week, the observed high-percentile worker utilization in eqiad looks like:

deployment=mw-api-ext:  p95=0.696  p75=0.628
deployment=mw-api-int:  p95=0.627  p75=0.566
deployment=mw-web:      p95=0.619  p75=0.514

This suggests that 60% is a reasonable target for p95 in most cases, though mw-api-ext seems to run a bit hotter and closer to 70%.

In any case, this is all to say:

  • We'll definitely need to scale up before the switchover (or a pre-test).
  • We may need to run a bit warmer at high percentiles in order to fit (at least in the pre-test scenario).
  • We can use the simplistic increments described here as a starting point, but will probably want to mix and match (e.g., it's probably not unreasonable to run mw-api-int hotter).

If we do run a pre-test with one or more of these services, that would definitely be valuable - especially as an opportunity to double check the relatively unsophisticated analysis here.

Longer term, we probably want to automate something like this - i.e., if it turns out we are running close to our headroom in the hypothetical event of site loss in either direction, we should know that.

I chatted with @Clement_Goubert a bit earlier, and it sounds like targeting 60% utilization at p95 is probably not necessary - 70% or possibly even 75% should be a fine starting point (the PHPFPMTooBusy alert threshold is the latter).

The same numbers for 70%:

deployment=mw-api-ext:  p75=250 p95=268
deployment=mw-api-int:  p75=228 p95=248
deployment=mw-web:      p75=381 p95=427

(i.e., +62, +0, +125 respectively) and 75%:

deployment=mw-api-ext:  p75=233 p95=250
deployment=mw-api-int:  p75=213 p95=232
deployment=mw-web:      p75=356 p95=399

(i.e., +44, +0, +97 respectively).

If we start with 75% at p95 and adjust as needed, that's 141 additional replicas, which will definitely fit - that's about 733 CPU-s/s, and we have ~ 2500 and 1800 available in codfw (actual switchover target) and eqiad (where we'll capacity test), respectively.

For the latter, that will push is close to the ~ 1k available mark, which is where deployments may be slowed a bit (h/t to @Clement_Goubert for this threshold). We'll have to keep an eye on that during the test, particularly if we need to scale up further.

As for exactly when the test will happen, one additional aspect to consider is the additional demand on WAN links due to previously codfw-bound RO traffic flowing to eqiad instead, and whether that conflicts with any planned network maintenance. I chatted with @cmooney about this a bit, and the only planned work for next week that would be good to avoid are the codfw row C/D switch migrations happening Tu/W/Th at 16:00 UTC.

Notably, this configuration won't be directly comparable to any phase of the switchover in terms of demand on codfw <> eqiad WAN links (e.g., both codfw and eqiad will remain pooled for CDN traffic), but we can put a loose upper bound on it by simply taking the aggregate total network usage for the respective services. In any case, we'll keep an eye on link saturation during the test.

Alright, following up on this:

I'd propose that we this on Thursday of this week during the MediaWiki infrastructure (UTC late) (17:00 UTC) deployment window.

Shortly before the window (to be coordinated with @RLazarus to avoid conflicts related to the puppet request window), I'll merge and apply changes to scale mw-api-ext and mw-web in eqiad to the 75% at p95 targets in T371273#10126548 (mw-api-int requires no changes).

Once the window starts, I'll depool mw-api-int-ro, mw-api-ext-ro, and mw-web-ro in from codfw in that order (staggered a bit), while monitoring php-fpm worker saturation and codfw <> eqiad WAN link usage.

The point of this exercise is to provide basic validation that the extrapolation above makes sense. Given that, and the fact that it's unlikely we'd manage the catch weekly top-5% load for a given service over the course of a short test, it's probably fine if we repool codfw after only an hour or so.

If we're feeling particularly adventurous (and if the slot is actually going to be used), we could leave things in place through all or part of the MediaWiki secondary train window starting at 18:00. Indeed, we should be able to fit deployments without issue using these targets, but it may still be a useful exercise to see whether there's any slowdown none-the-less.

Change #1073904 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mw-(api-ext|web): scale up to 75% at p95 targets

https://gerrit.wikimedia.org/r/1073904

Change #1073904 merged by jenkins-bot:

[operations/deployment-charts@master] mw-(api-ext|web): scale up to 75% at p95 targets

https://gerrit.wikimedia.org/r/1073904

Mentioned in SAL (#wikimedia-operations) [2024-09-19T16:48:33Z] <swfrench-wmf> scaling up mw-api-ext in eqiad for pre-switchover testing - T371273

Mentioned in SAL (#wikimedia-operations) [2024-09-19T16:50:49Z] <swfrench-wmf> scaling up mw-web in eqiad for pre-switchover testing - T371273

Mentioned in SAL (#wikimedia-operations) [2024-09-19T17:02:31Z] <swfrench@cumin2002> conftool action : set/pooled=false; selector: dnsdisc=mw-api-int-ro,name=codfw [reason: Pre-switchover capacity validation - T371273]

Mentioned in SAL (#wikimedia-operations) [2024-09-19T17:08:50Z] <swfrench@cumin2002> conftool action : set/pooled=false; selector: dnsdisc=mw-api-ext-ro,name=codfw [reason: Pre-switchover capacity validation - T371273]

Mentioned in SAL (#wikimedia-operations) [2024-09-19T17:17:34Z] <swfrench@cumin2002> conftool action : set/pooled=false; selector: dnsdisc=mw-web-ro,name=codfw [reason: Pre-switchover capacity validation - T371273]

Change #1074232 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] Revert "mw-(api-ext|web): scale up to 75% at p95 targets"

https://gerrit.wikimedia.org/r/1074232

Mentioned in SAL (#wikimedia-operations) [2024-09-19T17:46:19Z] <swfrench@cumin2002> conftool action : set/pooled=true; selector: dnsdisc=mw-api-int-ro,name=codfw [reason: Reverting pre-switchover capacity validation - T371273]

Mentioned in SAL (#wikimedia-operations) [2024-09-19T17:49:30Z] <swfrench@cumin2002> conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-ro,name=codfw [reason: Reverting pre-switchover capacity validation - T371273]

Mentioned in SAL (#wikimedia-operations) [2024-09-19T17:53:50Z] <swfrench@cumin2002> conftool action : set/pooled=true; selector: dnsdisc=mw-web-ro,name=codfw [reason: Reverting pre-switchover capacity validation - T371273]

Change #1074232 merged by jenkins-bot:

[operations/deployment-charts@master] Revert "mw-(api-ext|web): scale up to 75% at p95 targets"

https://gerrit.wikimedia.org/r/1074232

Mentioned in SAL (#wikimedia-operations) [2024-09-19T18:02:13Z] <swfrench-wmf> scaling down mw-web in eqiad after pre-switchover testing - T371273

Mentioned in SAL (#wikimedia-operations) [2024-09-19T18:02:51Z] <swfrench-wmf> scaling down mw-api-ext in eqiad after pre-switchover testing - T371273

Alright, well that was pleasantly uneventful.

Cutting out the periods while traffic was shifting, the deployments in eqiad were carrying full load during the following time windows (UTC):

  • mw-api-int: 17:08 - 17:46
  • mw-api-ext: 17:15 - 17:53
  • mw-web: 17:21 - 17:54

While the amount of data to work with is limited by the short duration of the test, the distribution of observed utilization looks promising, in that it doesn't seem to wildly contradict what we've extrapolated. Specifically, we saw:

deployment=mw-api-int:  p50=0.44 p75=0.47 p95=0.51
deployment=mw-api-ext:  p50=0.67 p75=0.69 p95=0.73
deployment=mw-web:      p50=0.64 p75=0.65 p95=0.66

vs. what we get if we take the same historical data as before and apply the 75% at p95 replica counts:

deployment=mw-api-int: p50=0.56 p75=0.59 p95=0.65
deployment=mw-api-ext: p50=0.65 p75=0.70 p95=0.75
deployment=mw-web:     p50=0.63 p75=0.67 p95=0.75

while noting that we didn't scale mw-api-int down to the proposed target from it's current very-well-provisioned state.

While these are of course wildly different sample sizes, and the former samples only one specific time of day / week, we did at least conduct the test relatively close to daily (Thursday) peak for eqiad-bound traffic to both mw-web and mw-api-ext.

Looking at latency, it looks like we gained:

  • mw-web: ~ 20ms at p50, 30-40ms at p75, 50-100ms at p99
  • mw-api-ext: 5-10ms at p50, 30-40ms at p75, ?? at p99 (kind of hart to say, given the spikiness of p99)
  • mw-api-int: 5-10ms at p50, 30ms at p75 (spiky), ?? at p99 (hard to pick out a discernible increment)

Interestingly, these are no larger than the latency bumps we see routinely during deployments.

In any case, I think these numbers still seem reasonable to use as a starting point, which we can then augment as needed based on observed utilization / latency throughout the day following the switchover(s).

Nicely done! Thanks for the detailed writeup!

(note: it seems odd that that codfw has ~ 1k more allocatable CPU - I've not dug into that yet).

There was a 8 machine discrepancy around that time (it's less now that we decommissioned some nodes), which would explain around ~400 CPUs. The rest is probably different hardware generations across the 5 years we have machines for.

Thanks, @akosiaris! That's a good point, and agreed that a different hardware generation should be enough to explain the remaining difference in aggregate core counts.

Alright, since there's nothing else explicitly tracked here, I am going to resolve this.

After the switchover, we should consider revisiting the extrapolation used here, and whether it can / should be used in some way to inform sizing at steady state.