Page MenuHomePhabricator

Eventstreams in codfw down for several hours due to kafka2001 -> kafka-main2001 swap
Closed, ResolvedPublic0 Estimated Story Points

Description

Hello everybody,

I noticed some SERVICE_UNKNOWN icinga alerts for all the eventstreams daemons on scb2*, and when I jumped on the hosts I found that:

The issue seems to have started at 2019-06-27 at around 15:30 UTC. The logs around that time report:

{"name":"eventstreams","hostname":"scb2001","pid":28944,"level":40,"message":"Heap memory limit temporarily exceeded","limit":786432000,"memoryUsage":{"rss":905519104,"heapTotal":821465088,"h
eapUsed":787728896,"external":375775},"levelPath":"warn/service-runner/heap","msg":"Heap memory limit temporarily exceeded","time":"2019-06-27T15:20:49.081Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":4628,"id":"3eabf172-98ef-11e9-af9e-f32d23fced49","level":30,"msg":"Creating new KafkaSSE instance 3eabf172-98ef-11e9-af9e-f32d23fced49.","tim
e":"2019-06-27T15:21:32.757Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":28944,"level":40,"message":"Heap memory limit temporarily exceeded","limit":786432000,"memoryUsage":{"rss":906203136,"heapTotal":821465088,"h
eapUsed":788269760,"external":399475},"levelPath":"warn/service-runner/heap","msg":"Heap memory limit temporarily exceeded","time":"2019-06-27T15:21:49.083Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":4628,"id":"3eabf172-98ef-11e9-af9e-f32d23fced49","level":50,"err":{"message":"broker transport failure","name":"Error","stack":"Error: Local:
 Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafka
/lib/error.js:260:10)\n    at /srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafka/lib/client.js:339:28","code":-195},"msg":"broke
r transport failure","time":"2019-06-27T15:22:02.862Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":4628,"id":"3eabf172-98ef-11e9-af9e-f32d23fced49","level":30,"msg":"Disconnecting.","time":"2019-06-27T15:22:02.862Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":4628,"id":"3eabf172-98ef-11e9-af9e-f32d23fced49","level":30,"msg":"Closed HTTP request.","time":"2019-06-27T15:22:02.863Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":23206,"id":"5d5ef44a-98ef-11e9-8a7a-3b80333ebd17","level":30,"msg":"Creating new KafkaSSE instance 5d5ef44a-98ef-11e9-8a7a-3b80333ebd17.","ti
me":"2019-06-27T15:22:24.262Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":28944,"level":60,"message":"Heap memory limit exceeded","limit":786432000,"memoryUsage":{"rss":905822208,"heapTotal":823562240,"heapUsed":787
789352,"external":388893},"levelPath":"fatal/service-runner/heap","msg":"Heap memory limit exceeded","time":"2019-06-27T15:22:49.083Z","v":0}
{"name":"eventstreams","hostname":"scb2001","pid":23206,"id":"5d5ef44a-98ef-11e9-8a7a-3b80333ebd17","level":50,"err":{"message":"broker transport failure","name":"Error","stack":"Error: Local
: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafk
a/lib/error.js:260:10)\n    at /srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafka/lib/client.js:339:28","code":-195},"msg":"brok
er transport failure","time":"2019-06-27T15:22:54.366Z","v":0}

And then a lot of:

{"name":"eventstreams","hostname":"scb2001","pid":27327,"id":"b8dd28d2-98f2-11e9-b7b8-5fe9ebb59a43","level":50,"err":{"message":"broker transport failure","name":"Error","stack":"Error: Local: Broker transport failure\n    at Function.createLibrdkafkaError [as create] (/srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafka/lib/error.js:260:10)\n    at /srv/deployment/eventstreams/deploy-cache/revs/07033d4d18f06cbe06dc69d8ebb9725bba16185f/node_modules/node-rdkafka/lib/client.js:339:28","code":-195},"msg":"broker transport failure","time":"2019-06-27T15:46:56.378Z","v":0}

After a rolling restart everything restarted working as before. If I am not mistaken the service is running active/active, so it wasn't really great that all codfw was down for several hours :(

The only relevant action that I noticed on the SAL around that time was:

https://tools.wmflabs.org/sal/log/AWuZZBG4OwpQ-3PktaQH (kafka2001 -> kafka-main2001 swap)

Event Timeline

elukey triaged this task as Medium priority.Jun 28 2019, 9:06 AM
elukey created this task.

@Samwalton9 We are currently working on it, there is indeed an issue with the eqiad backend hosts of eventstreams :(

We got the following alarm at 10:42 UTC (why only analytics? Shouldn't this be owned by SRE?)

10:42  <icinga-wm> PROBLEM - Check if active EventStreams endpoint is delivering messages. on icinga1001 is CRITICAL: CRITICAL:
                   No EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventStreams

https://stream.wikimedia.org/v2/stream/recentchange was returning 502 from nginx, on the Varnish esams and eqiad caches. I checked the backends on various scb nodes (eqiad and codfw), the local endpoint was working fine. The I checked the dashboard:

Screen Shot 2019-06-28 at 1.44.35 PM.png (591×2 px, 197 KB)

As you can see after me rolling restart the codfw eventstreams, there was a slow creep up of connections on scb1* nodes. Eventually I was able to solve this with a roll restart of eventstreams on scb1* nodes (you can see it on the graph).

No idea what happened in here, but I suspect that our throttling (https://phabricator.wikimedia.org/T196553) played a role.

EventStreams is hitting its concurrent connection limits of about 200 connections. We think this is probably due to a single client starting many connections, but aren't yet 100% sure about that. We are looking into it!

Mentioned in SAL (#wikimedia-operations) [2019-06-28T17:36:31Z] <ottomata> restarting eventstreams on scb1001 with trace logging of X-Client-IP for T226808

Collected some info about which IPs were connecting on scb1001. Over a period of about 40 minutes:

 3 "100.26.159.155"
54 "104.248.45.188"
 1 "172.16.1.236"

Change 519713 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Configure per Client-IP connection limits

https://gerrit.wikimedia.org/r/519713

Change 519717 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Tempoary hack - Block IP from making connections

https://gerrit.wikimedia.org/r/519717

Change 519717 merged by Ottomata:
[mediawiki/services/eventstreams@master] Temporary hack - Block IP from making connections

https://gerrit.wikimedia.org/r/519717

Mentioned in SAL (#wikimedia-operations) [2019-06-28T21:16:47Z] <otto@deploy1001> Started deploy [eventstreams/deploy@2af2719]: Manually blacklisting IP - T226808

Mentioned in SAL (#wikimedia-operations) [2019-06-28T21:19:54Z] <otto@deploy1001> Finished deploy [eventstreams/deploy@2af2719]: Manually blacklisting IP - T226808 (duration: 03m 07s)

To hold us over on the weekend, I've manually blacklisted the offending IP in EventStreams code and deployed. We'll work on a better solution next week.

Screen Shot 2019-06-28 at 4.00.56 PM.png (902×1 px, 140 KB)

Well, blocking that one IP had the effect of lowering connections. Given we have 200 connections we were thinking of limiting the number of connections stablished per IP (even if they are long lived) to 1/2.

Change 519713 merged by Ottomata:
[mediawiki/services/eventstreams@master] Support per X-Client-IP connection limits

https://gerrit.wikimedia.org/r/519713

Change 520059 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams/deploy@master] Support X-Client-IP connection limiting

https://gerrit.wikimedia.org/r/520059

Change 520059 merged by Ottomata:
[mediawiki/services/eventstreams/deploy@master] Support X-Client-IP connection limiting

https://gerrit.wikimedia.org/r/520059

Change 520068 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Limit EventStreams per X-Client-IP connections

https://gerrit.wikimedia.org/r/520068

Change 520074 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[mediawiki/services/eventstreams@master] Only increment clientConnectionMetric if it is set

https://gerrit.wikimedia.org/r/520074

Change 520074 merged by Ottomata:
[mediawiki/services/eventstreams@master] Only increment clientConnectionMetric if it is set

https://gerrit.wikimedia.org/r/520074

Change 520068 merged by Ottomata:
[operations/puppet@production] Limit EventStreams per X-Client-IP connections

https://gerrit.wikimedia.org/r/520068

Ok, all patches ready to go. Deployed in beta and looks good there. It is near the end of my day now, so I'll wait until tomorrow to deploy to production eventstreams.

Mentioned in SAL (#wikimedia-operations) [2019-07-02T14:07:10Z] <otto@deploy1001> Started deploy [eventstreams/deploy@de1d356]: Limit concurrent number of connections per X-Client-IP - T226808

Mentioned in SAL (#wikimedia-operations) [2019-07-02T14:13:27Z] <otto@deploy1001> Finished deploy [eventstreams/deploy@de1d356]: Limit concurrent number of connections per X-Client-IP - T226808 (duration: 06m 17s)

Hm, I just noticed there are more eventstreams processors than I had thought. There are 6 scb nodes in codfw, and 4 nodes in eqiad, for a total of 208 processors between them. eventstreams is configured to spawn one worker per processor. As is, this won't help keeping the varnish connection pool from filling up.

We could increase the varnish connection pool. Might not be a bad idea. Or we could decrease the number of eventstreams workers. Also might not be a bad idea.

Change 520247 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::eventstreams: add X-Client-IP to the health check

https://gerrit.wikimedia.org/r/520247

Change 520247 merged by Ottomata:
[operations/puppet@production] profile::eventstreams: add X-Client-IP to the health check

https://gerrit.wikimedia.org/r/520247

Ah still wrong. Full details.

The varnish limit is 25 per varnish-backend instance in core DCs (eqiad and codfw). Only the varnish-backend in core DCs connects to the application. Since this service is active-active, and there are 10 text varnishes in codfw and 8 in eqiad, the varnish concurrent connection limit for eventstreams is 25*10 + 25*8 == 450.

During this outage, the offending connections were mostly from esams, which is routed to eqiad, which is why we saw connections failing there at about 200. In that case, the 4 nodes in eqiad have a total of 112 processors. So, assuming a single X-Client-IP connects to e.g. esams or eqiad > 112 times and the connection is evenly spread between all workers, they will finally be rejected. Assuming they are equally spread is a big assumption. I'm still not sure this fix will catch the baddy.

Per our conversation at standup we should probably have limits per host, does the IP limit need to be at the master process?

Krinkle claimed this task.
Krinkle removed Krinkle as the assignee of this task.
Krinkle updated the task description. (Show Details)
Krinkle subscribed.