A change to move MediaWiki from using the local mcrouter to a kubernetes cluster wide mcrouter was pushed (T346690)In mw-on-k8s in eqiad, we switched mcrouter's location to mcrouter-main.mw-mcrouter.svc.cluster.local:4442 instead of 127.0.0.1:11213. ThisThe same changes makes mediawiki make way more DNS resolutions than previously. was deployed on codfw the day before, The capacity allocated to coredns was not enough to handle thatbut codfw has less traffic.
We depooled eqiadThis change increased the number of DNS requests towards CoreDNS, from an average of 40k req/s to 110k req/s, and then we bumped coredns in codfw and eqiadoverwhelming the pods.
Impact: for approximately 39 minutes 09:15-09:54, the primary dc was unavailable (returning 5XX http errors) or degraded with increased latencies for writes or reads, impacting most edits and uncached read requests routed to eqiad (those not in the Americas).**Status at ~09:20 UTC**:
* scap was blocked waiting for the deployment of mw-on-k8s to finish
* during the deployment, the mediawiki pods were never becoming ready, and after a while scap attampted to rollback
* CoreDNS pods (3) were overwhelmed and oom killed over and over again (being left in an crashloopbackoff state)
**Actions:**
* depooled mediawiki reads from eqiad (via discovery)
* Increase memory limits and replicas for coredns on wikikube clusters
* terminate mw-server FQDN with a dot - mcrouter-main.mw-mcrouter.svc.cluster.local.:4442
* reverted eqiad to use in-pod mcrouter container
* pooled eqiad back
**Impact: **
For approximately 39 minutes, 09:15-09:54 UTC, the primary dc was unavailable (returning 5XX http errors), degraded with increased latencies for writes or reads, impacting most edits and uncached read requests routed to eqiad (those not in the Americas). From HAproxy's POV (edge), non 5XX requests (text+upload) dropped from 138K rps, to 120K - 130K rps for 20'.
**Incident Report:** [[ https://wikitech.wikimedia.org/wiki/Incidents/2024-04-17_mw-on-k8s_eqiad_outage | 2024-04-17 mw-on-k8s eqiad outage ]]