Page MenuHomePhabricator

EventStreams socket stays connected without any traffic incoming
Open, HighPublic

Description

Through my testing, I’ve noticed that I am regularly encountering a situation in my bot where I stop receiving new events on a stream while it still keeps itself connected (in ESTABLISHED state) without any traffic coming through. This usually happens not out of the gate, but after several reconnection attempts caused by planned connection disruption (see T242767: EventStreams drops the connection after 15 minutes, which makes it unreliable / T248736: ats-tls ran out of FDs on cp1089). I am wondering whether this is the same problem as the one in the other tasks or is this something different.

I am trying to code different workarounds around existing issues of EventStreams, since I don’t like to restart the bot virtually every day, but this issue is hindering my ability to detect that the things are wrong in the first place.

Off the top of my head, similar thing has already been described in T179986: Investigate why current es2r daemon is randomly hanging, but I would like for someone to confirm whether this is the same issue.

Event Timeline

Milimetric subscribed.

Thanks for the report, can you please link us to the client code? This is a python bot? How often does it happen? Every time after X minutes with no events? Or just sometimes?

Sorry for not answering this sooner. I am running a Discord bot for Wikimedia Discord servers that is written in C# (all bot-specific code is in EventStreams.cs file). It uses C# library EvtSource to read from streams, which does the most of the legwork. It is hard for me to say how it happens since there are too many disconnects right now as it is, but it happens sometimes every day, sometimes once in three or five days. Two weeks ago it was basically once a day, for example.

Hm, this is going to be hard to reproduce, especially given T179986. I'd expect this to be a client side problem, but it could be something on EventStreams side. Can we wait until T179986 is resolved before investigating further? Perhaps if that is fixed your problem will just disappear! :)

Hm, this is going to be hard to reproduce, especially given T179986. I'd expect this to be a client side problem, but it could be something on EventStreams side. Can we wait until T179986 is resolved before investigating further? Perhaps if that is fixed your problem will just disappear! :)

Did you mean T242767: EventStreams drops the connection after 15 minutes, which makes it unreliable being resolved, not T179986: Investigate why current es2r daemon is randomly hanging? Either way, I mostly logged this so I won’t forget that such a problem exists, of course, I am more than happy to wait.

Oops I did mean what you said. THANK YOU

Hello, i believe i may be having a similar issue.

We built a Java tool that eats the event stream and also occasionally see incoming events stalling. Sometimes the stall is intermittent, sometimes for the time that is left before the 15minute connection time out. Any stall observed can happen anywhere in the 15 minute span.

Luckily, after every 15 minute reconnect, i am getting events again as usual.

To note: i observe this while still ingesting historic data of a few days old, so there are enough events still ahead. And i use Apache HttpClient for networking.

edit: i just also noticed that after two long stalls, just prior to the connection time out, a bunch of events are received.

Java code ingesting the events:
`

do {
  HttpRequest request = HttpRequest.newBuilder()
        .uri(URI.create(getStreamUrl()))
        .header("Accept", "application/json")
        .build();
        
  HttpResponse<InputStream> response = client.send(request, BodyHandlers.ofInputStream());
  InputStreamReader streamReader = new InputStreamReader(response.body(), StandardCharsets.UTF_8);
  BufferedReader reader = new BufferedReader(streamReader);
  
  try {
    for (String line; (line = reader.readLine()) != null;) {
      numProcessed++;
      ....

`

I believe it is solved for my case. I ditched Java's internal HTTP handling and replaced it with our internal utility using Apache HttpClient. Up until yesterday almost any other 15min cycle would stall somewhere, but never the first. Stalls also became more frequent.

With Apache HttpClient, the stream is being processed for almost two hours now without any stall. I am happy!

I too think I've been facing this issue – happened twice over the past 3 days, though IIRC those have been the only two occurrences this year. The bot's process remains active, but the onopen and onerror event listeners catch nothing.