MediaWiki monolog doesn't handle Kafka failures gracefully
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	faidon
	Jan 28 2016, 2:26 PM

Description

One of the Kafka brokers (kafka1012) was rebooted for a kernel upgrade and didn't successfully come back (for whatever reason).

Unfortunately, what followed was a full-blown API appserver outage for approximately 25 minutes, until we identified the cause (piled up SYN_SENT connections, blocking HHVM threads and ultimately DoSing it).

We depooled kafka1012 from mediawiki-config with rOMWC7273a5997d07 (and I'd be inclined to set all of wmgKafkaServers to false very soon), but this is a workaround. MediaWiki needs to be fixed to handle these (really, expected!) failures gracefully in the future.

Note that we have had a similar outage before, as @bd808 may remember (the exact same symptoms caused by Monolog attempting to connect to a failed Redis, IIRC). Let's all make sure this does not happen a third time, by testing those failure scenarios early in advance and before deploying this in production. See T88732 and how logging has been decoupled by using syslog over UDP.

Details

Subject	Repo	Branch	Lines +/-
Add kafka1012.eqiad.wmnet back to the media-wiki config.	operations/mediawiki-config	master	+1 -0
Reduce Kafka timeouts	operations/mediawiki-config	master	+2 -0
KafkaHandler: allow customizing timeouts	mediawiki/core	master	+16 -0
Add support for float timeouts in socket streams	operations/debs/hhvm	master	+80 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Qgil	T153007 Technical Collaboration annual plan FY2017-18
Resolved	Keegan	T131689 Second iteration of the Technical Collaboration strategy
Declined	None	T926 Engage with established technical communities
Resolved	Qgil	T102790 Engineering Community quarterly review (as part of the Community Engagement review)
Resolved	Qgil	T93770 Engineering Community quarterly goals for April-June 2015
Invalid	None	T98348 Implement the Wikimedia Foundation Call to Action 2015
Invalid	None	T98359 Create spaces for future community-led innovations and new knowledge creation
Resolved	None	T98361 Strengthen partnerships with organizations that use or contribute free content, or are aligned with the WMF in the free-knowledge movement
Declined	Qgil	T96013 Identify Wikimedia's top technical partners
Resolved	Qgil	T97283 Plan to focus on the Developer audience
Resolved	None	T114017 Map current use of Wikimedia web APIs
Resolved	bd808	T102079 Metrics about the use of the Wikimedia web APIs
Duplicate	None	T117203 [WD] External usage KPI
Resolved	Addshore	T119071 Track the access of Wikidata data through the API
Declined	Addshore	T117205 [WD] Partnerships usage KPI
Resolved	bd808	T108618 Publish detailed Action API request information to Hadoop
Resolved	Joe	T125084 MediaWiki monolog doesn't handle Kafka failures gracefully
Resolved	Joe	T119637 Update HHVM package to recent release
Resolved	Joe	T129467 HHVM 3.12 has a race-condition when starting up

Event Timeline

faidon created this task.Jan 28 2016, 2:26 PM

faidon raised the priority of this task from to Unbreak Now!.

faidon updated the task description. (Show Details)

faidon added projects: Discovery-ARCHIVED, MediaWiki-Logevents, SRE.

faidon added subscribers: faidon, EBernhardson, bd808, elukey.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2016, 2:26 PM

I just found out that there seems to be a bug in HHVM's fsockopen implementation, so that when you try to connect to a dead host, it will not respect the timeout you pass it from the invokation.

As a test I ran:

php -r 'var_dump(fsockopen("kafka1012.eqiad.wmnet", 9092, $errno, $errstr, 0.1));'

Which is supposed to return an error in 0.1 seconds; it does so on zend php, but it doesn't seem to be the case with HHVM, where the request hangs for several seconds.

My (very early) testing seems to show that HHVM 3.11 behaves correctly in this case.

JAllemandou subscribed.Jan 28 2016, 3:06 PM

So the workaround while we deploy a new HHVM package (which will take some time) is to depool kafka servers whenever we need to reboot them, or one goes down for good.

This is far from ideal, of course.

Hmm, I think we could put kafka clusters into pybal, and use LVS for bootstrapping/metadata lookups. Then clients could use something like kafka.analytics-eqiad.svc.eqiad.wmnet for bootstrapping. This might make pooling/depooling and automatically failing over bootstrap meta data requests easier.

We'd have to confirm that this would work with all Kafka clients, but I'm pretty sure it would.

hashar subscribed.Jan 28 2016, 4:46 PM

In T125084#1974286, @Ottomata wrote:

Hmm, I think we could put kafka clusters into pybal, and use LVS for bootstrapping/metadata lookups. Then clients could use something like kafka.analytics-eqiad.svc.eqiad.wmnet for bootstrapping. This might make pooling/depooling and automatically failing over bootstrap meta data requests easier.

We'd have to confirm that this would work with all Kafka clients, but I'm pretty sure it would.

That might have helped here but it's not really a solution to _this_ problem. Even if e.g. *all* brokers are down, the outage should not cascade to the (API) appserver fleet.

Reminder to delete workaround documentation here and here when this ticket is fixed.

Edited task detail to point to T88732: Decouple logging infrastructure failures from MediaWiki logging which is the log/redis issue faidon was referring to. Got solved by using syslog/UDP for transport.

Added also a note in https://wikitech.wikimedia.org/wiki/Service_restarts

For the reference, upstream fix: https://github.com/facebook/hhvm/commit/88e9ca810d1af78b63cf1668841fa38b2b0a01ba

MaxSem added a subtask: T119637: Update HHVM package to recent release.Jan 28 2016, 11:16 PM

Change 267196 had a related patch set uploaded (by MaxSem):
KafkaHandler: allow customizing timeouts

https://gerrit.wikimedia.org/r/267196

gerritbot added a project: Patch-For-Review.Jan 29 2016, 12:23 AM

Change 267200 had a related patch set uploaded (by MaxSem):
Reduce Kafka timeouts

https://gerrit.wikimedia.org/r/267200

I think we should just backport this patch to our current package while we are confident releasing a new one. This is absolutely devastating as a single server going down (not just kafka, but anything we connect to using fsockopen) can cause a site-wide outage.

Thanks @MaxSem for taking the time of finding the upstream fix.

Change 267228 had a related patch set uploaded (by Giuseppe Lavagetto):
Add support for float timeouts in socket streams

https://gerrit.wikimedia.org/r/267228

Patch applied and new package built.

The package was installed on labs and my test of fsockopen now shows that the timeout is respected.

I am going to install the newbuilt package on the canaries now.

Change 267228 merged by Giuseppe Lavagetto:
Add support for float timeouts in socket streams

https://gerrit.wikimedia.org/r/267228

Change 267196 merged by jenkins-bot:
KafkaHandler: allow customizing timeouts

https://gerrit.wikimedia.org/r/267196

ReleaseTaggerBot added projects: MW-1.27-release (WMF-deploy-2016-02-02_(1.27.0-wmf.12)), MW-1.27-release-notes.Jan 29 2016, 8:00 PM

Eevans mentioned this in T125394: Ensure that EventBus extension gracefully handles service failures.Feb 1 2016, 3:30 PM

All appservers have been upgraded, I'll perform some tests tomorrow to ensure this is solved.

Joe claimed this task.Feb 1 2016, 6:38 PM

pointing mediawiki on a production to an inexistent IP for kafka does not leave stale connections behind anymore, nor busy threads.

bd808 added a parent task: T108618: Publish detailed Action API request information to Hadoop.Feb 2 2016, 9:06 PM

Change 267200 merged by jenkins-bot:
Reduce Kafka timeouts

https://gerrit.wikimedia.org/r/267200

Change 273488 had a related patch set uploaded (by Elukey):
Add kafka1012.eqiad.wmnet back to the media-wiki config.

https://gerrit.wikimedia.org/r/273488

Change 273488 merged by jenkins-bot:
Add kafka1012.eqiad.wmnet back to the media-wiki config.

https://gerrit.wikimedia.org/r/273488

Joe closed subtask T119637: Update HHVM package to recent release as Resolved.Mar 16 2016, 7:08 AM

Krinkle edited projects, added MediaWiki-Debug-Logger; removed MW-1.27-release (WMF-deploy-2016-02-02_(1.27.0-wmf.12)), Patch-For-Review, MediaWiki-Logevents.Jul 26 2019, 5:05 PM

MediaWiki monolog doesn't handle Kafka failures gracefullyClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

MediaWiki monolog doesn't handle Kafka failures gracefully
Closed, ResolvedPublic
Actions

Related Objects
Search...