Page MenuHomePhabricator

EventStreams drops the connection after 15 minutes, which makes it unreliable
Open, HighPublic

Description

I am using EventStreams feed /v2/stream/recentchange in DiscordWikiBot with the help of a C# library called EvtSource. Around the end of November 2019 it started to report an exception every 15 minutes (counted from the start of the connection to stream.wikimedia.org) from some unrecognised responses. I haven’t changed anything in the code around that time and nothing was changed in EvtSource either, having used stream.wikimedia.org since July 2019.

I tried to look what might’ve gone wrong, but I couldn’t find it myself, so I thought I could ask the developers what kind of response EventStreams could send every 15 minutes and from which side fixing this should be approached.

Links to code, if they are helpful:

Console logs from my bot (don’t tell much other than constant timing; 3-4 seconds is, obviously, the amount of time it spends waiting to retry):

[2020-01-14 14:25:23] [EventStreams] [Warning] Stream returned the following exception (retry in 3000): System.IO.IOException: The server returned an invalid or unrecognized response.
   at System.Net.Http.HttpConnection.FillAsync()
   at System.Net.Http.HttpConnection.ChunkedEncodingReadStream.ReadAsyncCore(Memory`1 buffer, CancellationToken cancellationToken)
   at System.IO.StreamReader.ReadBufferAsync()
   at System.IO.StreamReader.ReadLineAsyncInternal()
   at EvtSource.EventSourceReader.ReaderAsync()
[2020-01-14 14:40:27] [EventStreams] [Warning] Stream returned the following exception (retry in 3000): System.IO.IOException: The server returned an invalid or unrecognized response. <…>
[2020-01-14 14:55:30] [EventStreams] [Warning] Stream returned the following exception (retry in 3000): System.IO.IOException: The server returned an invalid or unrecognized response. <…>
[2020-01-14 15:10:34] [EventStreams] [Warning] Stream returned the following exception (retry in 3000): System.IO.IOException: The server returned an invalid or unrecognized response. <…>
[2020-01-14 15:25:37] [EventStreams] [Warning] Stream returned the following exception (retry in 3000): System.IO.IOException: The server returned an invalid or unrecognized response. <…>

I was told in IRC that it is OK to ask this over Phabricator.

Event Timeline

stjn created this task.Jan 14 2020, 4:56 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 14 2020, 4:56 PM
stjn renamed this task from Unknown message from EventStreams goes unhandled to Unknown message from EventStreams causes errors in third-party library.Jan 14 2020, 4:57 PM
stjn updated the task description. (Show Details)Jan 14 2020, 5:02 PM

Maybe this might help, it’s every 20 seconds now, so.

stjn renamed this task from Unknown message from EventStreams causes errors in third-party library to EventStreams drops the connection after 15 minutes, which makes it unreliable.Mar 15 2020, 7:39 PM

We’ve checked it both with WireShark and via JavaScript EventSource API, and both times we got the same issue where the server drops the connection exactly after 15 minutes, so I changed the title of this task.

Hi! I'm sorry that this is the first time I'm seeing this task!

I'm surprised that this was happening back in November to you. We recently (this month) migrated the EventStreams service to Kubernetes, and along the way encountered an issue with some HTTPS termination logic doing exactly as you say; disconnecting after 15 minutes. This should be fixed now though. Additionally, we just found and solved a memory leak that certainly would have caused some disconnects, although I wouldn't expect them to be so often.

I'll be deploying the memory leak fix today, so can you check again tomorrow and verify if this is still happening?

Ottomata claimed this task.Mar 23 2020, 3:34 PM
Ottomata moved this task from Incoming to Event Platform on the Analytics board.
Ottomata added a project: Analytics-Kanban.
stjn added a comment.Mar 24 2020, 2:34 PM

Is it deployed already? Currently (24 March, 14:32 UTC) disconnects at the same frequency, though maybe I need to restart for it to take the effect.

Ottomata triaged this task as High priority.EditedMar 24 2020, 3:22 PM
Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

Yup it's deployed. I doubt you'd need to restart your client if the problem is indeed server side.

Interesting, I think I can reproduce with a simple curl. This does not happen from inside WMF infrastructure. And since it seems to be exactly 15 minutes, I'd guess there is a timeout somewhere in the HTTP(S) frontends on our side. I will investigate further.

Ottomata added a subscriber: ema.Mar 24 2020, 3:58 PM

Ah ha, @ema thinks it is https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#proxy-config-http-transaction-active-timeout-in. I believe it was around in November when we migrated to ATS for HTTPS termination. He is going to fix this for stream.wikimedia.org.

Ottomata reassigned this task from Ottomata to ema.Mar 24 2020, 4:25 PM
Ottomata added a project: Traffic.
Restricted Application added a project: Operations. · View Herald TranscriptMar 24 2020, 4:25 PM
ema moved this task from Triage to Caching on the Traffic board.Mar 25 2020, 9:08 AM

Change 583295 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disable transaction_active_timeout_in for EventStreams

https://gerrit.wikimedia.org/r/583295

Change 583295 merged by Ema:
[operations/puppet@production] ATS: disable transaction_active_timeout_in for EventStreams

https://gerrit.wikimedia.org/r/583295

ema added a comment.Mar 25 2020, 4:04 PM

Change 583295 merged by Ema:
[operations/puppet@production] ATS: disable transaction_active_timeout_in for EventStreams

https://gerrit.wikimedia.org/r/583295

That was it, I've started a GET request for recentchanges more than 20 minutes ago and it's still up. @stjn: could you please check if your bot now works as expected? Thanks!

stjn closed this task as Resolved.Mar 26 2020, 3:19 PM

Frequent disconnects stopped after 25th March, 15:30 UTC, so yes. Thank you for resolving this!

ema reopened this task as Open.Mar 30 2020, 8:43 AM
ema added a subscriber: Vgutierrez.

The patch disabling transaction_active_timeout was reverted by @Vgutierrez (thanks!) as it caused issues over the weekend: T248736. This means that clients connected to EventStreams will again be disconnected every 15 minutes till we find an appropriate way to tackle the issue. Sorry @stjn!

stjn added a comment.EditedMar 30 2020, 9:40 PM

I was wondering why did this start coming up again :-) Alas, hope a solution can be figured up.

Change 593517 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: disabling transaction_active_timeout_in for test eventstreams

https://gerrit.wikimedia.org/r/593517

Change 595484 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/debs/trafficserver@master] Release 8.0.7-1wm3

https://gerrit.wikimedia.org/r/595484

Change 595484 merged by Vgutierrez:
[operations/debs/trafficserver@master] Release 8.0.7-1wm3

https://gerrit.wikimedia.org/r/595484

Mentioned in SAL (#wikimedia-operations) [2020-05-11T10:15:10Z] <vgutierrez> upload trafficserver 8.0.7-1wm3 to apt.wm.o (buster) - T242767 T249335

Mentioned in SAL (#wikimedia-operations) [2020-05-12T06:56:54Z] <vgutierrez> upload trafficserver 8.0.7-1wm4 to apt.wm.o (buster) - T242767 T249335

Hio, I see some upgrades happened, should they have fixed this?

Heh, just tried myself, I guess not; still got disconnected after 15 minutes.

Nuria moved this task from In Progress to Paused on the Analytics-Kanban board.Wed, Jun 3, 4:55 PM