Page MenuHomePhabricator

mw-api-ext unavailability 2024-05-22 18:30 UTC
Open, Needs TriagePublic

Description

Shortly after 18:30 today, mw-api-ext (eqiad) alerted for probe failures (page), php-fpm worker saturation, and latency.

@CDanis prepared a timeline from the IRC log here: https://etherpad.wikimedia.org/p/2024-05-24

There are some interesting observations in there, including:

  • Unusually high tx from a single memcache task in the 18:28 - 18:30 range (e.g., due to a hot key)
  • OOM kills in the ~ 18:35 range, affecting both mw-api-ext and mw-jobrunner
  • mwlog1002 alerted for packet loss during that window, possibly due to large volume of logs

We also see unusually high network tx from mw-api-ext in the 18:35-40 range (https://grafana.wikimedia.org/goto/jcAq9aPIg?orgId=1). Once sflow ingestion is done, this might be useful in identifying what behavior is involved. It doesn't look like memcache traffic (e.g., there's no corresponding rx jump), but it might be log volume.

One other potentially useful observation is a 5x bump in write ops to s4 (commons) correlated with the start of slowness (https://grafana.wikimedia.org/goto/fUvhQ-ESR?orgId=1). I say "potentially" here, as this doesn't seem to be a terribly unusual occurrence on its own.

PHP slow-log events for mw-api-ext clearly kick up in the 18:33 minute (https://logstash.wikimedia.org/app/dashboards#/view/74557260-a88f-11ed-96bb-4b4732aa077a?_g=h@cc7791f&_a=h@af8cf1d), but it's challenging to discern cause from effect (e.g., there are many slow DB queries, but no consistent culprit code path).

I'll check back later on once turnilo has sflow data for the relevant window.

In any case, opening this to get everything written down in one place, particularly in case folks have seen this kind of event previously.

Event Timeline

The sflow data is now available in turnilo.

The high network tx from mw-api-ext does indeed appear directed to mwlog1002 (https://w.wiki/AAA2).

Specifically, the peak around ~ 18:35 matches up with the that seen for mw-api-ext, while the "shoulder" extending through 18:55 is coming from mw-jobrunner (https://grafana.wikimedia.org/goto/8I64mfESR?orgId=1).

Selecting mwlog1002 as dst and spot-checking top src IPs confirms these are pod IPs.

Picking an arbitrary 10s slice of mediawiki logs from mw-jobrunner to avoid query timeouts in logstash (18:40:00 - 18:40:10) , I'm seeing ~ 2.6M events in that time (https://logstash.wikimedia.org/app/dashboards#/view/AXFV7JE83bOlOASGccsT?_g=h@16abfda&_a=h@e892a70).

Effectively all of those are WARNINGs with messages like Wikimedia\Rdbms\LoadBalancer::runPrimaryTransactionIdleCallbacks: found writes pending. ~ 2.3M of those are for commonswiki, ~ 280k for enwiki, ...

Spot checking some of them, every one I've looked at is hitting Wikimedia\RequestTimeout\EmergencyTimeoutException and in turn triggering the MWExceptionHandler::rollbackPrimaryChangesAndLog code path at [0] which then winds up emitting these logs at [1].

Unfortunately, I can't find any logs that emit the actual EmergencyTimeoutException message, which would tell us things like the timeout used.

[0] https://gerrit.wikimedia.org/g/mediawiki/extensions/EventBus/+/ab3eabb5c9421864eb5d5c5ad0d4447a2855e4d6/includes/JobExecutor.php#117

[1] https://gerrit.wikimedia.org/g/mediawiki/core/+/658e6437a8e299e222f6546f91504f03fe2f5573/includes/libs/rdbms/loadbalancer/LoadBalancer.php#1534

Change #1049182 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mediawiki: Introduce rsyslog udp2log rate limiting

https://gerrit.wikimedia.org/r/1049182

Change #1049182 merged by jenkins-bot:

[operations/deployment-charts@master] mediawiki: Introduce rsyslog udp2log rate limiting

https://gerrit.wikimedia.org/r/1049182

Change #1049532 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: Rate limit udp2log

https://gerrit.wikimedia.org/r/1049532

Change #1049532 merged by jenkins-bot:

[operations/deployment-charts@master] mw-on-k8s: Rate limit udp2log

https://gerrit.wikimedia.org/r/1049532

Mentioned in SAL (#wikimedia-operations) [2024-06-25T12:46:28Z] <cgoubert@deploy1002> Started scap: Deploy udp2log rate-limiting - T365655 - T368098

Mentioned in SAL (#wikimedia-operations) [2024-06-25T12:51:40Z] <cgoubert@deploy1002> Finished scap: Deploy udp2log rate-limiting - T365655 - T368098 (duration: 05m 49s)

mw-on-k8s pods should now rate limit the logs sent to mwlog at 100 messages per second per pod.