Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

None

Authored By

	elukey
	Jun 28 2019, 9:06 AM

Description

NOTE: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190628-EventStreams

Hello everybody,

I noticed some SERVICE_UNKNOWN icinga alerts for all the eventstreams daemons on scb2*, and when I jumped on the hosts I found that:

curl http://localhost:8092/v2/stream/recentchange timed out
port 8092 didn't seem to be bound by any service

The issue seems to have started at 2019-06-27 at around 15:30 UTC. The logs around that time report:

{"name":"eventstreams","hostname":"scb2001","pid":28944,"level":40,"message":"Heap memory limit temporarily exceeded","limit":786432000,"memoryUsage":{"rss":905519104,"heapTotal":821465088,"h
eapUsed":787728896,"external":375775},"levelPath":"warn/service-runner/heap","msg":"Heap memory limit temporarily exceeded","time":"2019-06-27T15:20:49.081Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":4628,"id":"3eabf172-98ef-11e9-af9e-f32d23fced49","level":30,"msg":"Creating new KafkaSSE instance 3eabf172-98ef-11e9-af9e-f32d23fced49.","tim
e":"2019-06-27T15:21:32.757Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":28944,"level":40,"message":"Heap memory limit temporarily exceeded","limit":786432000,"memoryUsage":{"rss":906203136,"heapTotal":821465088,"h
eapUsed":788269760,"external":399475},"levelPath":"warn/service-runner/heap","msg":"Heap memory limit temporarily exceeded","time":"2019-06-27T15:21:49.083Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":4628,"id":"3eabf172-98ef-11e9-af9e-f32d23fced49","level":50,"err":{"message":"broker transport failure","name":"Error","stack":"Error: Local:
 Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafka
/lib/error.js:260:10)\n    at /srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafka/lib/client.js:339:28","code":-195},"msg":"broke
r transport failure","time":"2019-06-27T15:22:02.862Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":4628,"id":"3eabf172-98ef-11e9-af9e-f32d23fced49","level":30,"msg":"Disconnecting.","time":"2019-06-27T15:22:02.862Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":4628,"id":"3eabf172-98ef-11e9-af9e-f32d23fced49","level":30,"msg":"Closed HTTP request.","time":"2019-06-27T15:22:02.863Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":23206,"id":"5d5ef44a-98ef-11e9-8a7a-3b80333ebd17","level":30,"msg":"Creating new KafkaSSE instance 5d5ef44a-98ef-11e9-8a7a-3b80333ebd17.","ti
me":"2019-06-27T15:22:24.262Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":28944,"level":60,"message":"Heap memory limit exceeded","limit":786432000,"memoryUsage":{"rss":905822208,"heapTotal":823562240,"heapUsed":787
789352,"external":388893},"levelPath":"fatal/service-runner/heap","msg":"Heap memory limit exceeded","time":"2019-06-27T15:22:49.083Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":23206,"id":"5d5ef44a-98ef-11e9-8a7a-3b80333ebd17","level":50,"err":{"message":"broker transport failure","name":"Error","stack":"Error: Local
: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafk
a/lib/error.js:260:10)\n    at /srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafka/lib/client.js:339:28","code":-195},"msg":"brok
er transport failure","time":"2019-06-27T15:22:54.366Z","v":0}

And then a lot of:

{"name":"eventstreams","hostname":"scb2001","pid":27327,"id":"b8dd28d2-98f2-11e9-b7b8-5fe9ebb59a43","level":50,"err":{"message":"broker transport failure","name":"Error","stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafka/lib/error.js:260:10)\n    at /srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafka/lib/client.js:339:28","code":-195},"msg":"broker transport failure","time":"2019-06-27T15:46:56.378Z","v":0}

After a rolling restart everything restarted working as before. If I am not mistaken the service is running active/active, so it wasn't really great that all codfw was down for several hours :(

The only relevant action that I noticed on the SAL around that time was:

https://tools.wmflabs.org/sal/log/AWuZZBG4OwpQ-3PktaQH (kafka2001 -> kafka-main2001 swap)

Details

Subject	Repo	Branch	Lines +/-
profile::eventstreams: add X-Client-IP to the health check	operations/puppet	production	+1 -1
Limit EventStreams per X-Client-IP connections	operations/puppet	production	+31 -26
Only increment clientConnectionMetric if it is set	mediawiki/services/eventstreams	master	+8 -4
Support X-Client-IP connection limiting	mediawiki/services/eventstreams/deploy	master	+1 -1 M
Support per X-Client-IP connection limits	mediawiki/services/eventstreams	master	+55 -21
Temporary hack - Block IP from making connections	mediawiki/services/eventstreams	master	+11 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T226808 Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap
		Resolved		Ottomata	T227065 Move icinga alarm for the EventStreams external endpoint to SRE

Event Timeline

elukey triaged this task as Medium priority.Jun 28 2019, 9:06 AM

elukey created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 28 2019, 9:06 AM

elukey updated the task description. (Show Details)Jun 28 2019, 9:07 AM

https://stream.wikimedia.org/ seems to be down again?

@Samwalton9 We are currently working on it, there is indeed an issue with the eqiad backend hosts of eventstreams :(

Service should be restored now @Samwalton9

We got the following alarm at 10:42 UTC (why only analytics? Shouldn't this be owned by SRE?)

10:42  <icinga-wm> PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL:
                   No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams

https://stream.wikimedia.org/v2/stream/recentchange was returning 502 from nginx, on the Varnish esams and eqiad caches. I checked the backends on various scb nodes (eqiad and codfw), the local endpoint was working fine. The I checked the dashboard:

Screen Shot 2019-06-28 at 1.44.35 PM.png (591×2 px, 197 KB)

As you can see after me rolling restart the codfw eventstreams, there was a slow creep up of connections on scb1* nodes. Eventually I was able to solve this with a roll restart of eventstreams on scb1* nodes (you can see it on the graph).

No idea what happened in here, but I suspect that our throttling (https://phabricator.wikimedia.org/T196553) played a role.

elukey added a project: Analytics.Jun 28 2019, 11:49 AM

Looks good to me, thanks!

Ottomata mentioned this in T225005: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345].Jun 28 2019, 1:31 PM

EventStreams is hitting its concurrent connection limits of about 200 connections. We think this is probably due to a single client starting many connections, but aren't yet 100% sure about that. We are looking into it!

• Nuria subscribed.Jun 28 2019, 5:01 PM

• Pchelolo moved this task from Backlog to watching on the Services board.Jun 28 2019, 5:22 PM

• Pchelolo edited projects, added Services (watching); removed Services.

Mentioned in SAL (#wikimedia-operations) [2019-06-28T17:36:31Z] <ottomata> restarting eventstreams on scb1001 with trace logging of X-Client-IP for T226808

Collected some info about which IPs were connecting on scb1001. Over a period of about 40 minutes:

 3 "100.26.159.155"
54 "104.248.45.188"
 1 "172.16.1.236"

Change 519713 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Configure per Client-IP connection limits

https://gerrit.wikimedia.org/r/519713

gerritbot added a project: Patch-For-Review.Jun 28 2019, 8:44 PM

Change 519717 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Tempoary hack - Block IP from making connections

https://gerrit.wikimedia.org/r/519717

Ottomata added a project: acl*security.Jun 28 2019, 8:58 PM

Change 519717 merged by Ottomata:
[mediawiki/services/eventstreams@master] Temporary hack - Block IP from making connections

https://gerrit.wikimedia.org/r/519717

Mentioned in SAL (#wikimedia-operations) [2019-06-28T21:16:47Z] <otto@deploy1001> Started deploy [eventstreams/deploy@2af2719]: Manually blacklisting IP - T226808

Mentioned in SAL (#wikimedia-operations) [2019-06-28T21:19:54Z] <otto@deploy1001> Finished deploy [eventstreams/deploy@2af2719]: Manually blacklisting IP - T226808 (duration: 03m 07s)

To hold us over on the weekend, I've manually blacklisted the offending IP in EventStreams code and deployed. We'll work on a better solution next week.

Screen Shot 2019-06-28 at 4.00.56 PM.png (902×1 px, 140 KB)

Well, blocking that one IP had the effect of lowering connections. Given we have 200 connections we were thinking of limiting the number of connections stablished per IP (even if they are long lived) to 1/2.

Ottomata added a project: Wikimedia-Incident.Jul 1 2019, 3:47 PM

• fdans moved this task from Incoming to Operational Excellence on the Analytics board.Jul 1 2019, 3:49 PM

Change 519713 merged by Ottomata:
[mediawiki/services/eventstreams@master] Support per X-Client-IP connection limits

https://gerrit.wikimedia.org/r/519713

Change 520059 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams/deploy@master] Support X-Client-IP connection limiting

https://gerrit.wikimedia.org/r/520059

Change 520059 merged by Ottomata:
[mediawiki/services/eventstreams/deploy@master] Support X-Client-IP connection limiting

https://gerrit.wikimedia.org/r/520059

Change 520068 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Limit EventStreams per X-Client-IP connections

https://gerrit.wikimedia.org/r/520068

Change 520074 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Only increment clientConnectionMetric if it is set

https://gerrit.wikimedia.org/r/520074

Change 520074 merged by Ottomata:
[mediawiki/services/eventstreams@master] Only increment clientConnectionMetric if it is set

https://gerrit.wikimedia.org/r/520074

Change 520068 merged by Ottomata:
[operations/puppet@production] Limit EventStreams per X-Client-IP connections

https://gerrit.wikimedia.org/r/520068

Ok, all patches ready to go. Deployed in beta and looks good there. It is near the end of my day now, so I'll wait until tomorrow to deploy to production eventstreams.

Maintenance_bot removed a project: Patch-For-Review.Jul 1 2019, 9:10 PM

Mentioned in SAL (#wikimedia-operations) [2019-07-02T14:07:10Z] <otto@deploy1001> Started deploy [eventstreams/deploy@de1d356]: Limit concurrent number of connections per X-Client-IP - T226808

Mentioned in SAL (#wikimedia-operations) [2019-07-02T14:13:27Z] <otto@deploy1001> Finished deploy [eventstreams/deploy@de1d356]: Limit concurrent number of connections per X-Client-IP - T226808 (duration: 06m 17s)

Hm, I just noticed there are more eventstreams processors than I had thought. There are 6 scb nodes in codfw, and 4 nodes in eqiad, for a total of 208 processors between them. eventstreams is configured to spawn one worker per processor. As is, this won't help keeping the varnish connection pool from filling up.

We could increase the varnish connection pool. Might not be a bad idea. Or we could decrease the number of eventstreams workers. Also might not be a bad idea.

Change 520247 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::eventstreams: add X-Client-IP to the health check

https://gerrit.wikimedia.org/r/520247

gerritbot added a project: Patch-For-Review.Jul 2 2019, 2:31 PM

Change 520247 merged by Ottomata:
[operations/puppet@production] profile::eventstreams: add X-Client-IP to the health check

https://gerrit.wikimedia.org/r/520247

Maintenance_bot removed a project: Patch-For-Review.Jul 2 2019, 3:10 PM

Ah still wrong. Full details.

The varnish limit is 25 per varnish-backend instance in core DCs (eqiad and codfw). Only the varnish-backend in core DCs connects to the application. Since this service is active-active, and there are 10 text varnishes in codfw and 8 in eqiad, the varnish concurrent connection limit for eventstreams is 25*10 + 25*8 == 450.

During this outage, the offending connections were mostly from esams, which is routed to eqiad, which is why we saw connections failing there at about 200. In that case, the 4 nodes in eqiad have a total of 112 processors. So, assuming a single X-Client-IP connects to e.g. esams or eqiad > 112 times and the connection is evenly spread between all workers, they will finally be rejected. Assuming they are equally spread is a big assumption. I'm still not sure this fix will catch the baddy.

Per our conversation at standup we should probably have limits per host, does the IP limit need to be at the master process?

• Nuria closed subtask T227065: Move icinga alarm for the EventStreams external endpoint to SRE as Resolved.Jul 18 2019, 8:36 PM

elukey reopened subtask T227065: Move icinga alarm for the EventStreams external endpoint to SRE as Open.Aug 1 2019, 9:08 AM

• Nuria closed subtask T227065: Move icinga alarm for the EventStreams external endpoint to SRE as Resolved.Aug 21 2019, 5:42 PM

Krinkle moved this task from Active investigation to Draft and review incident doc on the Wikimedia-Incident board.Dec 17 2019, 10:16 PM

• chasemp added a project: Security.Feb 10 2020, 10:56 PM

• chasemp removed a project: acl*security.Feb 20 2020, 8:06 PM

Krinkle closed this task as Resolved.Apr 28 2020, 7:21 PM

Krinkle claimed this task.

Krinkle removed Krinkle as the assignee of this task.

Krinkle updated the task description. (Show Details)

Krinkle subscribed.

	F29666420: Screen Shot 2019-06-28 at 4.00.56 PM.png
	Jun 28 2019, 11:02 PM

	F29662104: Screen Shot 2019-06-28 at 1.44.35 PM.png
	Jun 28 2019, 11:49 AM

Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swapClosed, ResolvedPublic0 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap
Closed, ResolvedPublic0 Estimated Story Points
Actions

Related Objects
Search...