Page MenuHomePhabricator

EventStreams drops the connection after 15 minutes, which makes it unreliable
Closed, DeclinedPublic

Description

I am using EventStreams feed /v2/stream/recentchange in DiscordWikiBot with the help of a C# library called EvtSource. Around the end of November 2019 it started to report an exception every 15 minutes (counted from the start of the connection to stream.wikimedia.org) from some unrecognised responses. I haven’t changed anything in the code around that time and nothing was changed in EvtSource either, having used stream.wikimedia.org since July 2019.

I tried to look what might’ve gone wrong, but I couldn’t find it myself, so I thought I could ask the developers what kind of response EventStreams could send every 15 minutes and from which side fixing this should be approached.

Links to code, if they are helpful:

Console logs from my bot (don’t tell much other than constant timing; 3-4 seconds is, obviously, the amount of time it spends waiting to retry):

[2020-01-14 14:25:23] [EventStreams] [Warning] Stream returned the following exception (retry in 3000): System.IO.IOException: The server returned an invalid or unrecognized response.
   at System.Net.Http.HttpConnection.FillAsync()
   at System.Net.Http.HttpConnection.ChunkedEncodingReadStream.ReadAsyncCore(Memory`1 buffer, CancellationToken cancellationToken)
   at System.IO.StreamReader.ReadBufferAsync()
   at System.IO.StreamReader.ReadLineAsyncInternal()
   at EvtSource.EventSourceReader.ReaderAsync()
[2020-01-14 14:40:27] [EventStreams] [Warning] Stream returned the following exception (retry in 3000): System.IO.IOException: The server returned an invalid or unrecognized response. <…>
[2020-01-14 14:55:30] [EventStreams] [Warning] Stream returned the following exception (retry in 3000): System.IO.IOException: The server returned an invalid or unrecognized response. <…>
[2020-01-14 15:10:34] [EventStreams] [Warning] Stream returned the following exception (retry in 3000): System.IO.IOException: The server returned an invalid or unrecognized response. <…>
[2020-01-14 15:25:37] [EventStreams] [Warning] Stream returned the following exception (retry in 3000): System.IO.IOException: The server returned an invalid or unrecognized response. <…>

I was told in IRC that it is OK to ask this over Phabricator.

Event Timeline

stjn created this task.Jan 14 2020, 4:56 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 14 2020, 4:56 PM
stjn renamed this task from Unknown message from EventStreams goes unhandled to Unknown message from EventStreams causes errors in third-party library.Jan 14 2020, 4:57 PM
stjn updated the task description. (Show Details)Jan 14 2020, 5:02 PM

Maybe this might help, it’s every 20 seconds now, so.

stjn renamed this task from Unknown message from EventStreams causes errors in third-party library to EventStreams drops the connection after 15 minutes, which makes it unreliable.Mar 15 2020, 7:39 PM

We’ve checked it both with WireShark and via JavaScript EventSource API, and both times we got the same issue where the server drops the connection exactly after 15 minutes, so I changed the title of this task.

Hi! I'm sorry that this is the first time I'm seeing this task!

I'm surprised that this was happening back in November to you. We recently (this month) migrated the EventStreams service to Kubernetes, and along the way encountered an issue with some HTTPS termination logic doing exactly as you say; disconnecting after 15 minutes. This should be fixed now though. Additionally, we just found and solved a memory leak that certainly would have caused some disconnects, although I wouldn't expect them to be so often.

I'll be deploying the memory leak fix today, so can you check again tomorrow and verify if this is still happening?

Ottomata claimed this task.Mar 23 2020, 3:34 PM
Ottomata moved this task from Incoming to Event Platform on the Analytics board.
Ottomata added a project: Analytics-Kanban.
stjn added a comment.Mar 24 2020, 2:34 PM

Is it deployed already? Currently (24 March, 14:32 UTC) disconnects at the same frequency, though maybe I need to restart for it to take the effect.

Ottomata triaged this task as High priority.EditedMar 24 2020, 3:22 PM
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

Yup it's deployed. I doubt you'd need to restart your client if the problem is indeed server side.

Interesting, I think I can reproduce with a simple curl. This does not happen from inside WMF infrastructure. And since it seems to be exactly 15 minutes, I'd guess there is a timeout somewhere in the HTTP(S) frontends on our side. I will investigate further.

Ottomata added a subscriber: ema.Mar 24 2020, 3:58 PM

Ah ha, @ema thinks it is https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#proxy-config-http-transaction-active-timeout-in. I believe it was around in November when we migrated to ATS for HTTPS termination. He is going to fix this for stream.wikimedia.org.

Ottomata reassigned this task from Ottomata to ema.Mar 24 2020, 4:25 PM
Ottomata added a project: Traffic.
Restricted Application added a project: Operations. · View Herald TranscriptMar 24 2020, 4:25 PM
ema moved this task from Triage to Caching on the Traffic board.Mar 25 2020, 9:08 AM

Change 583295 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable transaction_active_timeout_in for EventStreams

https://gerrit.wikimedia.org/r/583295

Change 583295 merged by Ema:
[operations/puppet@production] ATS: disable transaction_active_timeout_in for EventStreams

https://gerrit.wikimedia.org/r/583295

ema added a comment.Mar 25 2020, 4:04 PM

Change 583295 merged by Ema:
[operations/puppet@production] ATS: disable transaction_active_timeout_in for EventStreams

https://gerrit.wikimedia.org/r/583295

That was it, I've started a GET request for recentchanges more than 20 minutes ago and it's still up. @stjn: could you please check if your bot now works as expected? Thanks!

stjn closed this task as Resolved.Mar 26 2020, 3:19 PM

Frequent disconnects stopped after 25th March, 15:30 UTC, so yes. Thank you for resolving this!

ema reopened this task as Open.Mar 30 2020, 8:43 AM
ema added a subscriber: Vgutierrez.

The patch disabling transaction_active_timeout was reverted by @Vgutierrez (thanks!) as it caused issues over the weekend: T248736. This means that clients connected to EventStreams will again be disconnected every 15 minutes till we find an appropriate way to tackle the issue. Sorry @stjn!

stjn added a comment.EditedMar 30 2020, 9:40 PM

I was wondering why did this start coming up again :-) Alas, hope a solution can be figured up.

Change 593517 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disabling transaction_active_timeout_in for test eventstreams

https://gerrit.wikimedia.org/r/593517

Change 595484 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.7-1wm3

https://gerrit.wikimedia.org/r/595484

Change 595484 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.7-1wm3

https://gerrit.wikimedia.org/r/595484

Mentioned in SAL (#wikimedia-operations) [2020-05-11T10:15:10Z] <vgutierrez> upload trafficserver 8.0.7-1wm3 to apt.wm.o (buster) - T242767 T249335

Mentioned in SAL (#wikimedia-operations) [2020-05-12T06:56:54Z] <vgutierrez> upload trafficserver 8.0.7-1wm4 to apt.wm.o (buster) - T242767 T249335

Hio, I see some upgrades happened, should they have fixed this?

Heh, just tried myself, I guess not; still got disconnected after 15 minutes.

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Jun 3 2020, 4:55 PM

Hey, can be fixed this problem?

ema closed this task as Declined.Jun 8 2020, 1:36 PM

Hey, can be fixed this problem?

I've decided to decline this task in the interpretation "connections should never drop" -- keeping T248736 open instead, the bug we uncovered by disabling transaction_active_timeout_in: that is definitely a problem and should be fixed.

When it comes to this very bug instead, clients cannot assume that once a TCP connection to our servers is established it will keep on working forever. Either every 15 minutes, once an hour, or at random points in time, the connection can and will be closed by the server side. It is up to the client to detect when this happens and act accordingly, for example by reconnecting. Even not taking network failures into account, for operational reasons we do restart specific software components (upgrades, configuration changes requiring a restart). In certain cases we need to reboot servers, for instance due to kernel upgrades. In other scenarios we drain an entire data center and direct all clients elsewhere via DNS. In general, we expect that within minutes a certain DC can be drained entirely, which isn't true if we allow clients to stay connected indefinitely.

Hm, EventStreams uses the Server Sent Events for this very reason. I don't think anyone is expecting that the connection remain always open. I'd expect that it remain open as long as possible and not be manually disconnected by the server.

What happens right now if someone had to download a large file from I dunno, commons or dumps.wikimedia.org on a slow connection, such that the download takes more than 15 minutes? Would they be disconnected and never be able to download the file?

I think we should reopen...the server should not disconnect EventStreams clients every 15 minutes.

stjn added a comment.EditedJun 8 2020, 4:44 PM

This is a very strange conclusion to this task. There was never an assumption that you do not need to reconnect to EventStreams from my part, on the contrary, if I didn’t have the means to restore the connection, I would’ve stopped using EventStreams ASAP in November 2019 instead of sticking with it for such a long time. The problem with constant disconnects is that you can’t meaningfully debug any other issues that might arise (like T250912: EventStreams socket stays connected without any traffic incoming) because EventStreams are so fragile for any user. And, as that other task shows, you can’t even rely on servers closing the connection right now, because sometimes they don’t.

BBlack added a subscriber: BBlack.Jun 8 2020, 4:59 PM

[reordering a little]

What happens right now if someone had to download a large file from I dunno, commons or dumps.wikimedia.org on a slow connection, such that the download takes more than 15 minutes? Would they be disconnected and never be able to download the file?

Large downloads are very different than idle and/or server-push connections. Clients typically handle this scenario by using chunked transfer encoding and saving partial data and using Range requests to break it up into smaller units and/or resume.

Hm, EventStreams uses the Server Sent Events for this very reason. I don't think anyone is expecting that the connection remain always open. I'd expect that it remain open as long as possible and not be manually disconnected by the server.
I think we should reopen...the server should not disconnect EventStreams clients every 15 minutes.

I think this is a tricky issue that's almost more philosophical than purely technical. It's a given that no network connection can actually persist indefinitely at Layer 4 and below, and that termination isn't even cleanly predictable in many cases.. Robust protocols and clients have to handle this gracefully anyways: reconnect on close with some sane logic at Layer 4, and also possibly some Layer 7 mechanisms for picking up where you left off or identifying transaction-level issues to avoid replays or missed data. For an event stream system in general, I'd imagine some kind of serial number marker on events that allows a reconnecting client to specify when (in the reasonable past) it would like to begin replaying from. As @ema pointed out, there's all kinds of legitimate reasons we'd cause disconnects at the edge layer, but mostly site depools, server depools, software restarts, etc (not to mention true unpredicted failures - crashes, link flaps, etc).

We do need timeouts on the server side as a protective measure against "leaked" dead sockets that consume resources and other such scenarios, and for those purposes, the shorter the better. Clients also don't want to be constantly reconnecting for a persistent push stream, either. From that perspective, the purpose of having a long-running persistent connection is mostly efficiency. The worst-case version of this would be one event per request and one request per connection, meaning the client has to re-establish a fresh connection for each received event. That would obviously be pretty inefficient for all parties involved. But even at a value like 15 minutes, that's a ton of amortization for the connection startup cost, with remaining efficiency gains from longer timeouts being pretty minimal.

I think there probably is some legitimate bikeshedding to do here about the exact value of the timeout.

This also bleeds into recent SRE-level discussions on how we should approach SLO in general. There's value in trying to actually hit an intended SLO target range and not overshoot it. For example (entirely made up): if we had an SLO for the duration of a persistent Evenstream connection before server-side termination set at 10 minutes, but in practice we were massively exceeding the target and averaging 9 hours before a connection fails for some reason, that's a problem. It's a problem because consumers will come to rely on the un-gauranteed 9 hours and begin building false assumptions around it, which leads to fragile design. From this sort of perspective, I really wish we had the tooling to randomize this timeout over a range, so that there's less of a predictable pattern to build any assumptions around. We may eventually get there, indirectly, if we have some ChaosMonkey -like agent randomly shooting down various pieces of edge infrastructure in the future.

The nature of the original task complains about a network behavior rather than a software bug. The network is and should be inherently unreliable, so it's not really legitimate to file a bug asking that network connections last for some longer indefinite amount of time to address what is likely a Layer 7 protocol issue or a client software bug.

Wrapping this up in some more-concrete directions, I'd ask:

  1. What do you think a reasonable timeout value would be (either just from the EventStream perspective, or also considering the edge's need for cleanliness)?
  1. Does the typical client handle the disconnect gracefully, or are some events lost?
  1. If events are typically lost, is this because of protocol/design limitations (e.g. lack of a parameter to replay recent-past events?), or is it just a client bug?
Ottomata added a comment.EditedJun 8 2020, 5:23 PM

(Thanks for the response bblack!)

  1. Does the typical client handle the disconnect gracefully, or are some events lost?

The typical SSE client (AKA EventSource) should reconnect at the last consumed offset (or timestamp, more correctly) that it saw. Events should not be lost due to disconnect. This is done via the SSE Last-Event-ID header. KafkaSSE expects the needed Kafka subscription information in Last-Event-ID.

What do you think a reasonable timeout value would be (either just from the EventStream perspective, or also considering the edge's need for cleanliness)?

I guess I'd like to hear from the EventStreams users on this ticket about this, but I'd hope the connection would last for at least several hours, or maybe an hour at minimum.

CDanis added a subscriber: CDanis.Jun 8 2020, 5:34 PM
stjn added a comment.Jun 8 2020, 9:25 PM

I guess I'd like to hear from the EventStreams users on this ticket about this, but I'd hope the connection would last for at least several hours, or maybe an hour at minimum.

I can only say that I agree. As I said above, the issue never was that EventStreams can drop the connection, the issue was that it happens so often that it is hard to work with it and to know what is attributable to the constant disconnects and what is attributable to other issues with EventStreams.

BBlack added a comment.Jun 9 2020, 2:05 PM

Looking at that other ticket T250912 - would an in-band service ping or NOP event of some kind address detecting "hung" connections at the application layer? Does SSE have any kind of null requests or null events that could be sent for that purpose?

What's the rationale behind any given time choice, though (Why would 15 vs 60 minutes disconnects make any difference to debugging some unrelated issue)? If anything it would seem like more-frequent disconnects would limit the impact of such an issue, by reducing the amount of time that could pass with no events being sent on an applayer-hung connection.

Hm, I'm pretty sure the connection is terminated even when there are events being sent.

 time curl https://stream.wikimedia.org/v2/stream/recentchange > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 43.7M    0 43.7M    0     0  50978      0 --:--:--  0:15:00 --:--:-- 49658
curl: (18) transfer closed with outstanding read data remaining

real	15m0.307s
user	0m3.242s
sys	0m3.570s

I never saw curl report that it wasn't receiving events. If the connection is timed out when there is no activity, that's one thing, but I think this ticket is about the server closing the connection even while there is activity.

ema added a comment.Jun 18 2020, 3:42 PM

Hm, I'm pretty sure the connection is terminated even when there are events being sent.

Definitely. It seems that @stjn finds it complicated to distinguish between disconnects issued by the CDN and problems related to EventStreams:

the issue never was that EventStreams can drop the connection, the issue was that it happens so often that it is hard to work with it and to know what is attributable to the constant disconnects and what is attributable to other issues with EventStreams.

That's why (I think) @BBlack was proposing of adding some sort of in-band service ping to eventstreams: to distinguish between the two (CDN disconnects vs eventstreams issues). Also:

What's the rationale behind any given time choice, though (Why would 15 vs 60 minutes disconnects make any difference to debugging some unrelated issue)? If anything it would seem like more-frequent disconnects would limit the impact of such an issue, by reducing the amount of time that could pass with no events being sent on an applayer-hung connection.

The CDN will disconnect in any case at some point for the various reasons mentioned in this ticket. We are wondering why 15 minutes vs 60 minutes vs once a day makes any difference. T250912 seems to indicate that the crux of the problem here is that the client does not manage to reconnect reliably, and that it stops receiving new events on a stream while it is connected (in ESTABLISHED state). So: we can increase the timeout to, I don't know, one hour, but that's not going to fix anything. It's just going to make T250912 harder to reproduce.

Just parking a crazy idea I just had, mostly irrelevant to this ticket.

Large downloads are very different than idle and/or server-push connections. Clients typically handle this scenario by using chunked transfer encoding and saving partial data and using Range requests to break it up into smaller units and/or resume.

Why are we using SSE at all if HTTP has a built in header for resuming a download? We could treat a stream as an unending download, and encode the same information that is currently in Last-Event-ID into the Range header. Then any http client that supports multi-part chunked downloads could potentially automatically reconnect, and users wouldn't have to deal with the (often poor quality) 3rd party SSE / EventSource clients out there.

We could treat a stream as an unending download, and encode the same information that is currently in Last-Event-ID into the Range header.

Ah, while this could work, clients would need to know how to keep track of their positions in the streams to provide in the Range header in case they disconnect. This is exactly what SSE/EventSource and Last-Event-ID were built for. Ok bad idea! :)

Some thoughts: I agree with @ema and @BBlack that we cannot expect connections to live "forever" and it seems that is the paradigm we are operating from when using SSE.

I think ultimately what we what to do is not very different from what is done in a chat application or similar where the client has what "seems" like a persistent connection to the chatserver but that in reality is not such. It is just an http long poll or similar where the client handles "state" (offset, in our case) . I think an HTTP long poll is a better analogy than an HTTP download of a file cause a download finishes, a long poll does not.

Then, on my opinion, makes sense to close this ticket, SSE clients are going to have to live with a disconnect that will happen either every 15 minutes or every hour, this has the implication that unless the endpoint allows consumption via an offset we cannot guarantee a client will have access to a 100% reliable stream of data.

SD0001 added a subscriber: SD0001.Sat, Nov 21, 9:10 AM