Page MenuHomePhabricator

MediaWiki monolog doesn't handle Kafka failures gracefully
Closed, ResolvedPublic

Description

One of the Kafka brokers (kafka1012) was rebooted for a kernel upgrade and didn't successfully come back (for whatever reason).

Unfortunately, what followed was a full-blown API appserver outage for approximately 25 minutes, until we identified the cause (piled up SYN_SENT connections, blocking HHVM threads and ultimately DoSing it).

We depooled kafka1012 from mediawiki-config with 7273a5997d07c02562d5a91a1e958cd089dcd663 (and I'd be inclined to set all of wmgKafkaServers to false very soon), but this is a workaround. MediaWiki needs to be fixed to handle these (really, expected!) failures gracefully in the future.

Note that we have had a similar outage before, as @bd808 may remember (the exact same symptoms caused by Monolog attempting to connect to a failed Redis, IIRC). Let's all make sure this does not happen a third time, by testing those failure scenarios early in advance and before deploying this in production. See T88732 and how logging has been decoupled by using syslog over UDP.

Related Objects

Event Timeline

faidon created this task.Jan 28 2016, 2:26 PM
faidon updated the task description. (Show Details)
faidon raised the priority of this task from to Unbreak Now!.
faidon added subscribers: faidon, EBernhardson, bd808, elukey.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2016, 2:26 PM
Joe added a subscriber: Joe.Jan 28 2016, 2:53 PM

I just found out that there seems to be a bug in HHVM's fsockopen implementation, so that when you try to connect to a dead host, it will not respect the timeout you pass it from the invokation.

As a test I ran:

php -r 'var_dump(fsockopen("kafka1012.eqiad.wmnet", 9092, $errno, $errstr, 0.1));'

Which is supposed to return an error in 0.1 seconds; it does so on zend php, but it doesn't seem to be the case with HHVM, where the request hangs for several seconds.

Joe added a comment.Jan 28 2016, 3:02 PM

My (very early) testing seems to show that HHVM 3.11 behaves correctly in this case.

Joe added a project: HHVM.Jan 28 2016, 3:19 PM
Joe set Security to None.

So the workaround while we deploy a new HHVM package (which will take some time) is to depool kafka servers whenever we need to reboot them, or one goes down for good.

This is far from ideal, of course.

Hmm, I think we could put kafka clusters into pybal, and use LVS for bootstrapping/metadata lookups. Then clients could use something like kafka.analytics-eqiad.svc.eqiad.wmnet for bootstrapping. This might make pooling/depooling and automatically failing over bootstrap meta data requests easier.

We'd have to confirm that this would work with all Kafka clients, but I'm pretty sure it would.

hashar added a subscriber: hashar.Jan 28 2016, 4:46 PM

Hmm, I think we could put kafka clusters into pybal, and use LVS for bootstrapping/metadata lookups. Then clients could use something like kafka.analytics-eqiad.svc.eqiad.wmnet for bootstrapping. This might make pooling/depooling and automatically failing over bootstrap meta data requests easier.

We'd have to confirm that this would work with all Kafka clients, but I'm pretty sure it would.

That might have helped here but it's not really a solution to _this_ problem. Even if e.g. *all* brokers are down, the outage should not cascade to the (API) appserver fleet.

Reminder to delete workaround documentation here and here when this ticket is fixed.

hashar updated the task description. (Show Details)Jan 28 2016, 5:03 PM

Edited task detail to point to T88732: Decouple logging infrastructure failures from MediaWiki logging which is the log/redis issue faidon was referring to. Got solved by using syslog/UDP for transport.

Change 267196 had a related patch set uploaded (by MaxSem):
KafkaHandler: allow customizing timeouts

https://gerrit.wikimedia.org/r/267196

Change 267200 had a related patch set uploaded (by MaxSem):
Reduce Kafka timeouts

https://gerrit.wikimedia.org/r/267200

Joe added a comment.Jan 29 2016, 8:24 AM

I think we should just backport this patch to our current package while we are confident releasing a new one. This is absolutely devastating as a single server going down (not just kafka, but anything we connect to using fsockopen) can cause a site-wide outage.

Thanks @MaxSem for taking the time of finding the upstream fix.

Change 267228 had a related patch set uploaded (by Giuseppe Lavagetto):
Add support for float timeouts in socket streams

https://gerrit.wikimedia.org/r/267228

Joe added a comment.Jan 29 2016, 12:02 PM

Patch applied and new package built.

The package was installed on labs and my test of fsockopen now shows that the timeout is respected.

I am going to install the newbuilt package on the canaries now.

Change 267228 merged by Giuseppe Lavagetto:
Add support for float timeouts in socket streams

https://gerrit.wikimedia.org/r/267228

Change 267196 merged by jenkins-bot:
KafkaHandler: allow customizing timeouts

https://gerrit.wikimedia.org/r/267196

Joe added a comment.Feb 1 2016, 6:37 PM

All appservers have been upgraded, I'll perform some tests tomorrow to ensure this is solved.

Joe claimed this task.Feb 1 2016, 6:38 PM
Joe closed this task as Resolved.Feb 2 2016, 5:55 PM

pointing mediawiki on a production to an inexistent IP for kafka does not leave stale connections behind anymore, nor busy threads.

Change 267200 merged by jenkins-bot:
Reduce Kafka timeouts

https://gerrit.wikimedia.org/r/267200

Change 273488 had a related patch set uploaded (by Elukey):
Add kafka1012.eqiad.wmnet back to the media-wiki config.

https://gerrit.wikimedia.org/r/273488

Change 273488 merged by jenkins-bot:
Add kafka1012.eqiad.wmnet back to the media-wiki config.

https://gerrit.wikimedia.org/r/273488