In mw-on-k8s in eqiad, we switched mcrouter's location to mcrouter-main.mw-mcrouter.svc.cluster.local:4442 instead of 127.0.0.1:11213. The same change was deployed on codfw the day before, but codfw has less traffic.
This change increased the number of DNS requests towards CoreDNS, from an average of 40k req/s to 110k req/s, overwhelming the pods.
Status at ~09:20 UTC:
- scap was blocked waiting for the deployment of mw-on-k8s to finish
- during the deployment, the mediawiki pods were never becoming ready, and after a while scap attampted to rollback
- CoreDNS pods (3) were overwhelmed and oom killed over and over again (being left in an crashloopbackoff state)
Actions:
- depooled mediawiki reads from eqiad (via discovery)
- Increase memory limits and replicas for coredns on wikikube clusters
- terminate mw-server FQDN with a dot - mcrouter-main.mw-mcrouter.svc.cluster.local.:4442
- reverted eqiad to use in-pod mcrouter container
- pooled eqiad back
Impact:
For approximately 39 minutes, 09:15-09:54 UTC, the primary dc was unavailable (returning 5XX http errors), degraded with increased latencies for writes or reads, impacting most edits and uncached read requests routed to eqiad (those not in the Americas). From HAproxy's POV (edge), non 5XX requests (text+upload) dropped from 138K rps, to 120K - 130K rps for 20'.
Incident Report: 2024-04-17 mw-on-k8s eqiad outage