mw-api-ext unavailability 2024-05-22 18:30 UTC
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	Scott_French
	May 22 2024, 10:24 PM

Description

Shortly after 18:30 today, mw-api-ext (eqiad) alerted for probe failures (page), php-fpm worker saturation, and latency.

@CDanis prepared a timeline from the IRC log here: https://etherpad.wikimedia.org/p/2024-05-24

There are some interesting observations in there, including:

Unusually high tx from a single memcache task in the 18:28 - 18:30 range (e.g., due to a hot key)
OOM kills in the ~ 18:35 range, affecting both mw-api-ext and mw-jobrunner
mwlog1002 alerted for packet loss during that window, possibly due to large volume of logs

We also see unusually high network tx from mw-api-ext in the 18:35-40 range (https://grafana.wikimedia.org/goto/jcAq9aPIg?orgId=1). Once sflow ingestion is done, this might be useful in identifying what behavior is involved. It doesn't look like memcache traffic (e.g., there's no corresponding rx jump), but it might be log volume.

One other potentially useful observation is a 5x bump in write ops to s4 (commons) correlated with the start of slowness (https://grafana.wikimedia.org/goto/fUvhQ-ESR?orgId=1). I say "potentially" here, as this doesn't seem to be a terribly unusual occurrence on its own.

PHP slow-log events for mw-api-ext clearly kick up in the 18:33 minute (https://logstash.wikimedia.org/app/dashboards#/view/74557260-a88f-11ed-96bb-4b4732aa077a?_g=h@cc7791f&_a=h@af8cf1d), but it's challenging to discern cause from effect (e.g., there are many slow DB queries, but no consistent culprit code path).

I'll check back later on once turnilo has sflow data for the relevant window.

In any case, opening this to get everything written down in one place, particularly in case folks have seen this kind of event previously.

Details

	Subject	Repo	Branch	Lines +/-
	mw-on-k8s: Rate limit udp2log	operations/deployment-charts	master	+3 -10
	mediawiki: Introduce rsyslog udp2log rate limiting	operations/deployment-charts	master	+32 -2

Customize query in gerrit

Related Objects

Mentioned In: T368098: Dumps generation without prefetch cause disruption to the production environment
Mentioned Here: T368098: Dumps generation without prefetch cause disruption to the production environment

Event Timeline

Scott_French created this task.May 22 2024, 10:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 22 2024, 10:24 PM

The sflow data is now available in turnilo.

The high network tx from mw-api-ext does indeed appear directed to mwlog1002 (https://w.wiki/AAA2).

Specifically, the peak around ~ 18:35 matches up with the that seen for mw-api-ext, while the "shoulder" extending through 18:55 is coming from mw-jobrunner (https://grafana.wikimedia.org/goto/8I64mfESR?orgId=1).

Selecting mwlog1002 as dst and spot-checking top src IPs confirms these are pod IPs.

Picking an arbitrary 10s slice of mediawiki logs from mw-jobrunner to avoid query timeouts in logstash (18:40:00 - 18:40:10) , I'm seeing ~ 2.6M events in that time (https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT?_g=h@16abfda&_a=h@e892a70).

Effectively all of those are WARNINGs with messages like Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending. ~ 2.3M of those are for commonswiki, ~ 280k for enwiki, ...

Spot checking some of them, every one I've looked at is hitting Wikimedia\RequestTimeout\EmergencyTimeoutException and in turn triggering the MWExceptionHandler::rollbackPrimaryChangesAndLog code path at [0] which then winds up emitting these logs at [1].

Unfortunately, I can't find any logs that emit the actual EmergencyTimeoutException message, which would tell us things like the timeout used.

[0] https://gerrit.wikimedia.org/g/mediawiki/extensions/EventBus/+/ab3eabb5c9421864eb5d5c5ad0d4447a2855e4d6/includes/JobExecutor.php#117

[1] https://gerrit.wikimedia.org/g/mediawiki/core/+/658e6437a8e299e222f6546f91504f03fe2f5573/includes/libs/rdbms/loadbalancer/LoadBalancer.php#1534

Joe mentioned this in T368098: Dumps generation without prefetch cause disruption to the production environment.Mon, Jun 24, 9:16 AM

Change #1049182 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Introduce rsyslog udp2log rate limiting

https://gerrit.wikimedia.org/r/1049182

gerritbot added a project: Patch-For-Review.Mon, Jun 24, 1:57 PM

RhinosF1 subscribed.Mon, Jun 24, 8:25 PM

Change #1049182 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Introduce rsyslog udp2log rate limiting

https://gerrit.wikimedia.org/r/1049182

Change #1049532 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: Rate limit udp2log

https://gerrit.wikimedia.org/r/1049532

Change #1049532 merged by jenkins-bot:

[operations/deployment-charts@master] mw-on-k8s: Rate limit udp2log

https://gerrit.wikimedia.org/r/1049532

Mentioned in SAL (#wikimedia-operations) [2024-06-25T12:46:28Z] <cgoubert@deploy1002> Started scap: Deploy udp2log rate-limiting - T365655 - T368098

Mentioned in SAL (#wikimedia-operations) [2024-06-25T12:51:40Z] <cgoubert@deploy1002> Finished scap: Deploy udp2log rate-limiting - T365655 - T368098 (duration: 05m 49s)

mw-on-k8s pods should now rate limit the logs sent to mwlog at 100 messages per second per pod.

Maintenance_bot removed a project: Patch-For-Review.Tue, Jun 25, 1:31 PM

dr0ptp4kt subscribed.Tue, Jun 25, 9:02 PM

mw-api-ext unavailability 2024-05-22 18:30 UTC Open, Needs TriagePublicActions

Description

Details

Related Objects

Event Timeline

mw-api-ext unavailability 2024-05-22 18:30 UTC
Open, Needs TriagePublic
Actions