One of the Kafka brokers (kafka1012) was rebooted for a kernel upgrade and didn't successfully come back (for whatever reason).
Unfortunately, what followed was a full-blown API appserver outage for approximately 25 minutes, until we identified the cause (piled up SYN_SENT connections, blocking HHVM threads and ultimately DoSing it).
We depooled kafka1012 from mediawiki-config with 7273a5997d07c02562d5a91a1e958cd089dcd663 (and I'd be inclined to set all of wmgKafkaServers to false very soon), but this is a workaround. MediaWiki needs to be fixed to handle these (really, expected!) failures gracefully in the future.
Note that we have had a similar outage before, as @bd808 may remember (the exact same symptoms caused by Monolog attempting to connect to a failed Redis, IIRC). Let's all make sure this does not happen a third time, by testing those failure scenarios early in advance and before deploying this in production. See T88732 and how logging has been decoupled by using syslog over UDP.