Page MenuHomePhabricator

Steady-state sizing of mw-web and mw-api-ext
Open, Needs TriagePublic

Description

In advance of the September 2024 switchover (T370962), these services were upsized to support serving all traffic from a single DC using the targets from T371273.

In the immediate term, we should determine whether to either accept this as the new normal (i.e., being "ready" for single-DC serving at steady state) or scale back down to multi-DC sizing (i.e., reverting https://gerrit.wikimedia.org/r/1075056).

My vote would be to revert, primarily because the current sizing is inherently a point-in-time estimate, and there's no guardrails to ensure it stays up to date.

On that topic, though, in the medium term, we should figure out how / whether to make the analysis in T371273 repeatable / automated in two contexts:

  1. quickly / easily identifying single-DC size targets (e.g., in advance of the next switchover, or in the event of an actual site failure); and
  2. monitoring whether there's sufficient k8s cluster capacity headroom to allow single-DC serving at all.

In the longer term, this probably ties in with any plans for custom-metric-driven HPA (i.e., using php-fpm active worker counts). For example:

  • Would we want to be able to easily switch the metric from multi- to single-DC mode? (e.g., in advance of a switchover)
  • In a future where DNS is not the underlying mechanism for DC-level load-balancing, would we be able to ramp traffic in a sufficiently controlled way that HPA can do the work for us? (setting aside emergencies where we don't have that luxury)

Event Timeline

I vote we revert as well, we can potentially leave in the values for single-DC and a link to T371273 as comments in the config so we have a reference point and a quick start point for an unplanned switch.

+1 to reverting and leaving the pointer to T371273 in the comments, for the already mentioned reasons.

In the immediate term, we should determine whether to either accept this as the new normal (i.e., being "ready" for single-DC serving at steady state) or scale back down to multi-DC sizing

There is one more option that lies somewhere between the two (single vs multi), where we budget our resources in such way so in case of an emergency, we will be able to offer users/our systems an acceptable experience[0], without having to immediately fire up a deployment to scale things up. I understand that there are multiple indicators (latency, worker saturation etc) as well as other variables (on/off peak times). However, if we pick and chose at least some that make sense to us, it might get us to a more 'ready for the worse' status.

make the analysis in T371273 repeatable

I absolutely agree that we could make this capacity assessment part of the switchover.

[0] https://wikitech.wikimedia.org/wiki/Performance/Real_user_monitoring. Disclaimer, I do not know who this data is collected and if this work is still maintained.

Thanks, all, for weighing in!

+1 to leaving the "valid as of September 2024" sizes around in the values file, commented out with details on when / where they came from.

At a minimum, adopting that strategy going forward would give us a twice-per-year re-calibration of those values, and thus at least some confidence that they approximate what we'll need in an unplanned switchover, while at the same time acting as a reminder that they may be stale.

@jijiki - I like your idea of identifying a steady-state size that should allow us to carry full load in a single DC without pre-scaling in an emergency, at least on a temporary basis. In an ideal world, we could justify this with an error budget - e.g., something that should be fine when off-peak, but might burn some acceptably small fraction of budget at peak (until a human, or perhaps HPA in the future, can intervene and upsize).

I'll give this some thought - i.e., how we might approach computing and validating this, and working it into the periodic / automated form the analysis.

In the immediate future, I think I'll probably start by going the revert / comment option.

Change #1078481 had a related patch set uploaded (by Scott French; author: Scott French):

[operations/deployment-charts@master] mw-(web|api-ext): revert to multi-DC sizing

https://gerrit.wikimedia.org/r/1078481

In an ideal world, this process would inform the upper and lower bounds of an HPA and we wouldn't need to come up with exact numbers, but rather rely on some metric (e.g. PHP idle workers) and let the system balance itself. And still we would need to run the process to update those bounds every now and then, because the world (and our traffic) isn't set in stone. So automating it a bit would be worth it.

While I like the SLO/Error Budget approach, that's a long way ahead, we barely have 1 MediaWiki related SLO (the edit check one), we 'd need way more before that's an option.

+1 to revert.