Page MenuHomePhabricator

Move EventStreams to main Kafka clusters
Closed, ResolvedPublic5 Estimated Story Points

Description

EventStreams is backed by the analytics Kafka cluster, which has its topics mirrored from main Kafka. EventStreams could run on main Kafka, which would give it better multi DC support and make it no longer rely on MirrorMaker.

We need to move EventStreams' Kafka from analytics to multi DC main. This will mean that offsets will change, which will disrupt connected SSE/EventSource clients. When they reconnect, they will provide invalid offsets to EventStreams. Fortunetly, node-rdkafka will just disregard these invalid offsets and start consuming at the end of the stream. This means that during the switchover, clients may lose a few messages.

Since this will effectively connect internet based users to the main Kafka clusters, we should implement some simple connection limits in EventStreams.

Anyway, for EventStreams:

Event Timeline

Ottomata triaged this task as Medium priority.Jan 18 2018, 4:15 PM
Ottomata created this task.

Change 405014 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use jumbo Kafka for EventStreams in deployment-prep

https://gerrit.wikimedia.org/r/405014

Change 405014 merged by Ottomata:
[operations/puppet@production] Use jumbo Kafka for EventStreams in deployment-prep

https://gerrit.wikimedia.org/r/405014

Now that main Kafka clusters have been upgraded to 1.x, we use a 1.x MirrorMaker, which so far is way more stable. I think we are ready to move forward with this. Time to write an email! :)

Here is a thought.

We are currently running EventStreams backed by the analytics-eqiad Kafka cluster, instead of the main-* clusters, so that we could avoid potential DoS attacks on the main-* Kafka cluster. However, running off of analytics-eqiad or jumbo-eqiad means that the data can only be served from eqiad, reducing the redundancy of the EventStreams service. It also means that EventStreams depends on MirrorMaker to replicate the data to jumbo-eqiad.

Should we consider running EventStreams on Kafka main-eqiad and main-codfw instead? If so, we'd likely want to implement some kind of throttling to keep internet users from causing too much load on Kafka. Could we limit the number of connections per IP, or even the bandwidth per connected client? Perhaps in varnish? I don't know how we usually do such things. :)

@BBlack @Pchelolo @mobrovac thoughts?

We actually have rate limiter implementation integrated into service-runner - it's based on DHT so it's cluster-wide and it's been tested pretty well for normal requests. However, I think as-is we can't really use it as event streams workload is quite different - the rate of requests to subscribe doesn't really tell anything about the load it's causing if the clients don't disconnect.

But I think we could imagine a way to reuse the standard rate limiter in this scenario as well by adapting it somehow. Or maybe just reuse the DHT backend to limit the number of concurrent connections. We first need to decide what do we want to limit..

I think concurrent connections would be a good enough start. Would that be easy enough? If so, I'd postpone the EventStreams switch to jumbo until it is implemented. Then we could switch it to main instead. A cluster switch is disruptive (sorta) to users, so the fewer we do the better.

I've looked into reusing our existing rate limiter for concurrent connections limiting instead of the request rate limiting, but it requires significant changes to the library as it was designed for a fairly different purpose. However, I believe adding a connection concurrency limiter based on KAD might be pretty easy. Alternatively we can just use redis for that?

Maybe worth ckecking varnish throttle? https://github.com/varnish/varnish-modules/blob/master/src/vmod_vsthrottle.vcc

I think it implements a leaky bucket for requests and we actually want to limit persistent connections. It might be doable if we can define a bucket for evenstreams with the number of consumers we think we can sustain.

Not sure if this throttle is defined per varnish host of varnish "fleet".

I think our varnish code limits are per IP, in this case that would not work as I believe we would need "service limits" for the eventstreams endpoint for the total number of consumers: https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/text-frontend.inc.vcl.erb#L218

Ottomata renamed this task from Move EventStreams to new jumbo cluster. to Move EventStreams to main Kafka clusters.EditedJun 6 2018, 1:46 PM
Ottomata updated the task description. (Show Details)

A low connection rate limiting by IP would be useful too, as it would keep someone from quickly opening and closing connections, and/or opening too many connections for themselves.

The link you posted indicates that the limit is 20 connections / second (with a 1000 burst?) per IP. That is a little high, but is probably good enough if we also have a total open connection limit in EventStreams.

Oh, 20/sec is PER varnish host. Hm.

Change 439653 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Switch evenstreams to main kafka in deployment-prep

https://gerrit.wikimedia.org/r/439653

Change 439653 merged by Ottomata:
[operations/puppet@production] Switch evenstreams to main kafka in deployment-prep

https://gerrit.wikimedia.org/r/439653

Ottomata moved this task from Next Up to In Code Review on the Analytics-Kanban board.
Ottomata moved this task from In Code Review to In Progress on the Analytics-Kanban board.

Change 439996 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use Kafka main-eqiad for EventStreams service

https://gerrit.wikimedia.org/r/439996

Change 439996 merged by Ottomata:
[operations/puppet@production] Use Kafka main-eqiad for EventStreams service

https://gerrit.wikimedia.org/r/439996

Mentioned in SAL (#wikimedia-operations) [2018-06-14T15:39:39Z] <ottomata> switching EventStreams service to be backed by main-eqiad - T185225

Change 440354 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove unneeded broker.version.fallback for eventstreams

https://gerrit.wikimedia.org/r/440354

Change 440354 abandoned by Ottomata:
Remove unneeded broker.version.fallback for eventstreams

Reason:
wrong topic branch

https://gerrit.wikimedia.org/r/440354

Change 440355 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Remove unneeded broker.version.fallback for eventstreams

https://gerrit.wikimedia.org/r/440355

Change 440355 merged by Ottomata:
[operations/puppet@production] Remove unneeded broker.version.fallback for eventstreams

https://gerrit.wikimedia.org/r/440355

please update wikitech docs about awesome time based consumption