Page MenuHomePhabricator

EventStreams returns 502 errors from outside the WMF network
Closed, ResolvedPublic3 Estimate Story Points


At 0730 UTC, Jan 31, 2019, I noticed EventStreams stopped working, as no services using it seemed to work.

Jynus (@jcrespo) on IRC stopped and started it again, and things came back, but he suggested I still file a bug report since the underlying issue may not be solved. Thanks.

@jcrespo: More info, no alert was produced, all eventstream-related services were green ( ) even if it may have malfunctioned for 2 days, and in fact, "curl" worked fine within the production network. However, from the outside, I got a 502 error, potentially created by the following message: (MAX_CONCURRENT_STREAMS == 128)

It seems like too many clients were connected at the same time, an issue happening since the 26th:

A restart of the evenstream service was done, which likely forced a disconnection, clients went down to 40, service started working again.

There are 2 issues here:

  • Research why the overload happens and mitigate (e.g. limit number of connections per IP, disconnect idle/bad clients, etc., increase limit, etc.)
  • Improve monitoring so that alerts happen in the above case (client overload)


Related Gerrit Patches:

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 31 2019, 11:34 AM
jcrespo renamed this task from EventStreams stopped working to EventStreams returns 502 errors from outside the WMF network.Jan 31 2019, 11:43 AM
jcrespo triaged this task as High priority.
jcrespo updated the task description. (Show Details)
jcrespo added subscribers: jcrespo, Ottomata, elukey.
jcrespo updated the task description. (Show Details)Jan 31 2019, 11:50 AM

CC @akosiaris as this is happening on scb servers.

Wow that is a lot more clients than we've ever had! @jcrespo thanks for bouncing the service. Where/how did you see this MAX_CONCURRENT_STREAMS == 128 error?

It was the only other information other than the return status that the headers or the content returned. The error only happened outside of the internal network- so I guess it could also be traffic related, and not service? Apparently stats came back to normal in the last days:

I am more worried about effectively an outage not being detected by monitoring than fixing any root cause, as congestion is something that there is only so much one can do.

fdans added a subscriber: fdans.Feb 7 2019, 6:16 PM

@Ottomata are there any actionables for us here?

Yes, at the least we need to make the aliveness check somehow check from outside prod networks.

Nuria assigned this task to Ottomata.Feb 11 2019, 4:51 PM
Nuria added a project: Analytics-Kanban.
Nuria moved this task from Incoming to Operational Excellence on the Analytics board.

Change 492199 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Monitor public endpoint

We believe this happened on purpose!

Look at the past me doing a great job:
T196553: Support connection/rate limiting in EventStreams

We just need better monitoring on if this happens. ^ adds

Change 492199 merged by Ottomata:
[operations/puppet@production] Monitor public endpoint

Change 492349 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Use nagios_common::check_command::config to define check_eventstreams

Change 492349 merged by Ottomata:
[operations/puppet@production] Use nagios_common::check_command::config to define check_eventstreams

Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.Feb 22 2019, 5:31 PM
Nuria closed this task as Resolved.Feb 25 2019, 10:38 PM
Nuria set the point value for this task to 3.