At 0730 UTC, Jan 31, 2019, I noticed EventStreams stopped working, as no services using it seemed to work.
Jynus (@jcrespo) on IRC stopped and started it again, and things came back, but he suggested I still file a bug report since the underlying issue may not be solved. Thanks.
@jcrespo: More info, no alert was produced, all eventstream-related services were green ( https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=eventstream ) even if it may have malfunctioned for 2 days, and in fact, "curl https://stream.wikimedia.org/v2/stream/recentchange" worked fine within the production network. However, from the outside, I got a 502 error, potentially created by the following message: (MAX_CONCURRENT_STREAMS == 128)
It seems like too many clients were connected at the same time, an issue happening since the 26th: https://grafana.wikimedia.org/d/000000336/eventstreams?refresh=1m&orgId=1&from=1548329837870&to=1548934637871&panelId=1&fullscreen&var-stream=All&var-topic=All&var-scb_host=All
A restart of the evenstream service was done, which likely forced a disconnection, clients went down to 40, service started working again.
There are 2 issues here:
- Research why the overload happens and mitigate (e.g. limit number of connections per IP, disconnect idle/bad clients, etc., increase limit, etc.)
- Improve monitoring so that alerts happen in the above case (client overload)