Page MenuHomePhabricator

Support connection/rate limiting in EventStreams
Closed, ResolvedPublic5 Estimated Story Points

Description

We want to limit the total number of EventStreams connections to Kafka. To do this, we really just need to limit the total number of concurrent EventStreams connections. This is more difficult than it sounds, since EventStreams connections are long lived never closed HTTP connections. To do a real total limit, we need to keep a distributed counter somewhere, perhaps redis. However, there doesn't seem to be an easy consistent way to reliably decrement the counter. A SIGKILL to the EventStreams process would cause the process to die without decrementing the counter.

Another less precise idea would be to do a non distributed counter somewhere. Varnish already supports a per-backend max_connections. The backend in this case is the eventstreams.svc.$dc.wmnet URL. So, we could do limiting at the varnish instance level. Depending on how the request is hashed to the varnish backend, some requests might reach the limit while there are plenty of open slots on other varnish instances.

We actually have the total number of connections (per stream) in statsd/graphite, but that works because statsd aggregates the counts for us as a gauge. We certainly aren't going to query graphite for this, but maybe there is some what to use an expiring gauge in redis? Or, perhaps we can reuse https://github.com/wikimedia/limitation with the same interval counter stuff we use in statsd?

Any other ideas?

Event Timeline

Ottomata triaged this task as Medium priority.Jun 6 2018, 1:47 PM
Ottomata created this task.
Ottomata edited subscribers, added: ema; removed: Platonides, BBlack.

Or, perhaps we can reuse https://github.com/wikimedia/limitation with the same interval counter stuff we use in statsd?

To me this sounds like the simplest solution of all since we don't need any new external dependencies and the algorithm seems to be the easiest.

That won't be super efficient though cause we would need some intervalled pings from each connection as well as some additional UDP packets for DHT synchronization.

To me this sounds like the simplest solution of all since we don't need any new external dependencies and the algorithm seems to be the easiest.

Ideally we will throttle before we even get to the service itself so as to deflect the request as early as possible, right? that is why I was thinking varnish *might be * a better option

The varnish option could be complimentary, but it won't really be sufficient. We aren't really worried about the service itself becoming overloaded, we just want to protect Kafka. To protect it with confidence, we need some way of at least approximating the total number of eventstream kafka consumer connections.

fdans raised the priority of this task from Medium to High.Jun 11 2018, 4:10 PM
fdans moved this task from Incoming to Operational Excellence on the Analytics board.

We had a meeting today to discuss, and decided that varnish instance level connection limiting was enough. We plan to limit connections per varnish instance. misc is moving into text, and there are 8 text caches per DC.

I just tested in deployment-prep and with EventStreams in production on Kafka analytics-eqiad. Kafka did not blink. In prod, I was able to start seeing node process heap errors at around 300 concurrent connections to eventstreams.svc.eqiad.wmnet. I got 502 errors when I tried to go about 100 connections when connecting to nginx at https://stream.wikimedia.org. I'm not sure why nginx was failing me, but perhaps traffic has some per-IP throttling to it at 100 connections there too? Anyway, if this dies at 100 connections here, a greater limit for the varnish backends won't hurt.

So, I'd say we are doing pretty good. My tests included consuming as much as possible from revision-create, so the last 7 days of data.

ab -H 'Last-Event-ID: [{"topic": "eqiad.mediawiki.revision-create", "partition": 0, "offset": 794889001}]' -v 2 -c 300 -t 10 http://eventstreams.svc.eqiad.wmnet:8092/v2/stream/revision-create

I suggest we set per varnish max_connections to 25. This will cover the rare (never?) case where our average connection count is about 25, and (almost) all connections are on through single varnish, AND support us up to 200 concurrent connections with 8 varnish servers.

Change 439772 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Set eventstreams max_connections to 25 per varnish instance

https://gerrit.wikimedia.org/r/439772

AND support us up to 200 concurrent connections with 8 varnish servers.

not a concern now but something to have in mind: if we add more varnish servers to this pool we are also bumping up the number of connections allowed to the backend,

Change 439911 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: avoid consistent hashing for pipe traffic

https://gerrit.wikimedia.org/r/439911

Change 439911 merged by Ema:
[operations/puppet@production] vcl: avoid consistent hashing for pipe traffic

https://gerrit.wikimedia.org/r/439911

Change 439929 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] vcl: properly choose backend in vcl_pipe

https://gerrit.wikimedia.org/r/439929

Change 439929 merged by Ema:
[operations/puppet@production] vcl: properly choose backend in vcl_pipe

https://gerrit.wikimedia.org/r/439929

Change 439772 merged by Ottomata:
[operations/puppet@production] Set eventstreams max_connections to 25 per varnish instance

https://gerrit.wikimedia.org/r/439772

Vvjjkkii renamed this task from Support connection/rate limiting in EventStreams to ijbaaaaaaa.Jul 1 2018, 1:05 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Ottomata as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed the point value for this task.
Vvjjkkii removed subscribers: gerritbot, Aklapper.