2024-04-17 mw-on-k8s eqiad outage
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jcrespo
	Apr 17 2024, 11:22 AM

Description

In mw-on-k8s in eqiad, we switched mcrouter's location to mcrouter-main.mw-mcrouter.svc.cluster.local:4442 instead of 127.0.0.1:11213. The same change was deployed on codfw the day before, but codfw has less traffic.

This change increased the number of DNS requests towards CoreDNS, from an average of 40k req/s to 110k req/s, overwhelming the pods.

Status at ~09:20 UTC:

scap was blocked waiting for the deployment of mw-on-k8s to finish
during the deployment, the mediawiki pods were never becoming ready, and after a while scap attampted to rollback
CoreDNS pods (3) were overwhelmed and oom killed over and over again (being left in an crashloopbackoff state)

Actions:

depooled mediawiki reads from eqiad (via discovery)
Increase memory limits and replicas for coredns on wikikube clusters
terminate mw-server FQDN with a dot - mcrouter-main.mw-mcrouter.svc.cluster.local.:4442
reverted eqiad to use in-pod mcrouter container
pooled eqiad back

Impact:
For approximately 39 minutes, 09:15-09:54 UTC, the primary dc was unavailable (returning 5XX http errors), degraded with increased latencies for writes or reads, impacting most edits and uncached read requests routed to eqiad (those not in the Americas). From HAproxy's POV (edge), non 5XX requests (text+upload) dropped from 138K rps, to 120K - 130K rps for 20'.

Incident Report: 2024-04-17 mw-on-k8s eqiad outage

Related Objects
Search...

Status	Assigned	Task
In Progress	None	T290536 Serve production traffic via Kubernetes
Resolved	jijiki	T277711 Memcached, mcrouter in MediaWiki on Kubernetes
Resolved	Joe	T278220 Define the size of a pod for mediawiki in terms of resource usage
Resolved	jijiki	T346690 mw-mcrouter daemonset on mw-on-k8s
Resolved	jijiki	T362766 2024-04-17 mw-on-k8s eqiad outage
Invalid	None	T363186 Cache mw-mcrouter service ClusterIP in apcu cache

Event Timeline

jcrespo created this task.Apr 17 2024, 11:22 AM

Restricted Application removed a project: SRE. · View Herald TranscriptApr 17 2024, 11:22 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

akosiaris subscribed.Apr 17 2024, 11:31 AM

The change was rolled back in eqiad, and eqiad was repooled around 10:45. A terminating dot was added to the DNS name in codfw to avoid a recursive request.

LSobanski subscribed.Apr 17 2024, 1:17 PM

As an aside, and contributing to the time to recovery, we observed the apache container getting oomkilled, we strongly suppose because of the backpressure from the php-fpm workers being busy waiting for the DNS response.

JMeybohm updated the task description. (Show Details)Apr 18 2024, 12:16 PM

Krinkle subscribed.Apr 18 2024, 11:05 PM

jijiki closed this task as Resolved.Apr 25 2024, 1:48 PM

jijiki claimed this task.

jijiki updated the task description. (Show Details)

jijiki renamed this task from 2024-04-17 mw-* went down in eqiad to 2024-04-17 mw-on-k8s eqiad outage.Apr 25 2024, 1:56 PM

jijiki added a parent task: T346690: mw-mcrouter daemonset on mw-on-k8s.Jun 5 2024, 11:36 AM

jijiki closed subtask T363186: Cache mw-mcrouter service ClusterIP in apcu cache as Invalid.Jul 16 2024, 3:09 PM