Page MenuHomePhabricator

2024-04-17 mw-on-k8s eqiad outage
Closed, ResolvedPublic

Description

In mw-on-k8s in eqiad, we switched mcrouter's location to mcrouter-main.mw-mcrouter.svc.cluster.local:4442 instead of 127.0.0.1:11213. The same change was deployed on codfw the day before, but codfw has less traffic.

This change increased the number of DNS requests towards CoreDNS, from an average of 40k req/s to 110k req/s, overwhelming the pods.

Status at ~09:20 UTC:

  • scap was blocked waiting for the deployment of mw-on-k8s to finish
  • during the deployment, the mediawiki pods were never becoming ready, and after a while scap attampted to rollback
  • CoreDNS pods (3) were overwhelmed and oom killed over and over again (being left in an crashloopbackoff state)

Actions:

  • depooled mediawiki reads from eqiad (via discovery)
  • Increase memory limits and replicas for coredns on wikikube clusters
  • terminate mw-server FQDN with a dot - mcrouter-main.mw-mcrouter.svc.cluster.local.:4442
  • reverted eqiad to use in-pod mcrouter container
  • pooled eqiad back

Impact:
For approximately 39 minutes, 09:15-09:54 UTC, the primary dc was unavailable (returning 5XX http errors), degraded with increased latencies for writes or reads, impacting most edits and uncached read requests routed to eqiad (those not in the Americas). From HAproxy's POV (edge), non 5XX requests (text+upload) dropped from 138K rps, to 120K - 130K rps for 20'.

Incident Report: 2024-04-17 mw-on-k8s eqiad outage

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The change was rolled back in eqiad, and eqiad was repooled around 10:45. A terminating dot was added to the DNS name in codfw to avoid a recursive request.

As an aside, and contributing to the time to recovery, we observed the apache container getting oomkilled, we strongly suppose because of the backpressure from the php-fpm workers being busy waiting for the DNS response.

jijiki claimed this task.
jijiki updated the task description. (Show Details)
jijiki renamed this task from 2024-04-17 mw-* went down in eqiad to 2024-04-17 mw-on-k8s eqiad outage.Apr 25 2024, 1:56 PM