In advance of the September 2024 switchover (T370962), these services were upsized to support serving all traffic from a single DC using the targets from T371273.
In the immediate term, we should determine whether to either accept this as the new normal (i.e., being "ready" for single-DC serving at steady state) or scale back down to multi-DC sizing (i.e., reverting https://gerrit.wikimedia.org/r/1075056).
My vote would be to revert, primarily because the current sizing is inherently a point-in-time estimate, and there's no guardrails to ensure it stays up to date.
On that topic, though, in the medium term, we should figure out how / whether to make the analysis in T371273 repeatable / automated in two contexts:
- quickly / easily identifying single-DC size targets (e.g., in advance of the next switchover, or in the event of an actual site failure); and
- monitoring whether there's sufficient k8s cluster capacity headroom to allow single-DC serving at all.
In the longer term, this probably ties in with any plans for custom-metric-driven HPA (i.e., using php-fpm active worker counts). For example:
- Would we want to be able to easily switch the metric from multi- to single-DC mode? (e.g., in advance of a switchover)
- In a future where DNS is not the underlying mechanism for DC-level load-balancing, would we be able to ramp traffic in a sufficiently controlled way that HPA can do the work for us? (setting aside emergencies where we don't have that luxury)