Page MenuHomePhabricator

EventStreams
Closed, ResolvedPublic0 Estimated Story Points

Description

(This is a parent/placeholder ticket for Q4 goals linking.)

https://meta.wikimedia.org/wiki/Research:MediaWiki_events:_a_generalized_public_event_datasource is a proposal that expands on the feature set currently available from RCStream. We would like to generalize this beyond just mediawiki events, and build a service that can make arbitrary event streams of JSON events available for public consumption.

A brainstorm meeting about this was held on March 15 2016. Notes from the meeting are here: https://etherpad.wikimedia.org/p/PublicEventBus

Tentative Plan:

  • Build a service that exposes configured Kafka topics via websockets or http. Offset/timestamp historical consumption and field filtering TBD. This should at least be feature compatible with current RCStream (e.g. wiki filtering).
  • Expose public events currently available in Kafka via this service.
  • Produce recent changes events to Kafka (possibly via EventBus service, but maybe not).
  • Serve recentchanges events from this service.
  • deprecate RCStream python/redis based service

Related Objects

StatusSubtypeAssignedTask
DeclinedXqt
DeclinedNone
DeclinedNone
ResolvedOttomata
Duplicate Pchelolo
Resolved Pchelolo
ResolvedOttomata
Resolved Nuria
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DeclinedOttomata
DeclinedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedHalfak
ResolvedJAllemandou
DeclinedNone
ResolvedOttomata
OpenNone
ResolvedOttomata
Resolvedmforns
ResolvedOttomata
ResolvedXqt
DuplicateNone
Resolved Cmjohnson
ResolvedXqt
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DeclinedOttomata
DeclinedOttomata
ResolvedOttomata

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Why the latter? If we exclude for a moment browser-based solutions from the discussion, what would be the benefits in using the latter solution over SSE?

SSE is nice in browsers and if you are coding in Javascript and have the EventSource module. But, it'd be really nice if you could just do

curl http://.../stream/{topic-assigment}

and get blasted the JSON objects on the CLI, that you could then pipe into whatever you want.

SSE requires something like EventSource to parse the events.

I see a lot of browser-based discussion here. I'd just like to contribute my voice as an existing consumer of the existing RC Stream and hopeful consumer of whatever follows.

Choosing a protocol that has broad library support (Java in my case but also Python etc), in addition JavaScript, is important to making a service that's useful to a wide audience.

@Afandian thanks for chiming in. Could you comment on what is easier for you in Java / Python? socket.io vs a more simple streamed HTTP response body, possibly with SSE/EventSource format? A quick google search shows some EventSource libraries for python, but I don't know how good they are.

@Ottomata I'm happy enough using the current Socket.io 0.9 client for Java (even if it is deprecated!). I don't have any complaints about the current situation beyond the protocol being a little long in the tooth. My preference would be something with supported libraries available so I don't have to worry too much about maintenance. If you just upgraded to a supported version of Socket.io I'd be happy.

I'm generally a bit wary of technology that's being incubated in the browser community because there's so much velocity that features can come and go, and that introduces the risk of technical debt. It's easier to update client-side browser code than live server-side infrastructure code. Case in point Socket.io version 0.9 was probably the best choice at the time, now it's unsupported.

I don't have a specific response to technology choices, but I'll go away and look closely at SSE vs socket.io.

I would suggest that acceptance criteria for the feature should include simple demo consumers for a number of popular languages. I'm happy to contribute to this effort.

(If I had a feature request it would be about catch-up, and that's probably orthogonal to this discussion).

(Funnily enough I'm planning to provide a similar service and considering a very vanilla HTTP pub-sub callback-based model. Not cutting-edge, and doesn't meet the same use browser cases, but it's simple.)

On the Java SSE client front, I see these two options:

The former looks lighter weight & more stand-alone, while the latter probably makes sense when using jersey anyway. I'm curious about your impression of these libraries.

Python:

Shell (no retry): curl <feed> | egrep '^data:' | sed s/^data:// | while read line;do ...; done
Others: https://en.wikipedia.org/wiki/Server-sent_events#Libraries

Ok, just found something out about SSE that makes me personally vote against it. Not that my vote matters a lot, just letting you know :)

So the thing is, I thought SSE was a decent idea because it had this auto-resume capability. But it turns out the auto-resume kicks in only if the server restarts. The client, sensing the restart, can resume fetching from the id it was last at. But if the client restarts, which is by far the more common case, the client has to store the id somewhere anyway. Well, if the client is storing that id anyway, then it doesn't care about the server restart case. On any problem, server or client, it can just wait and resume from the last stored id.

Therefore, to me, SSE is pointless. It has no real advantage to socket.io or even simple polling.

Skimming this discussion makes me think IRC isn't so bad.

The task description currently says "deprecate RCStream python/redis based service." Is this still part of the tentative plan? I thought RCStream would (someday) replace irc.wikimedia.org.

Sorry, chiming in quite late based on Andrew Otto's ping 13 days ago via email…

Server-Sent Events (SSE) is broadly supported on desktop and mobile with the infamous exception being Microsoft browsers. It is easy to use from a consumer standpoint from the client, i.e., trivial to parse, auto-reconnect, simple filtering based on event types,… and easy to produce on the server. It is unidirectional, but that doesn't matter in the concrete context. It can be consumed via cURL for quick sampling. SSE seems to not scale too well with many messages, check, for example, my Twitter over SSE example, the browser starts to go down on its knees.

Web Sockets is the more broadly supported standard and bidirectional. If going through the 1.0 version of the protocol, no additional libraries (that means Socket.IO in practice) are needed on the client (but requires going through Socket.IO with the current Wikimedia implementation). It has broad support on the server and is easy to create. I haven't done extensive tests with Web Sockets stand-alone, but it just seems to work fine with Socket.IO, which added a compatibility layer when browser support wasn't great.

I have played with both, see Bots vs. Wikipedians for SSE and Wikipedia Screensaver for Web Sockets with Socket.IO. In the long-term, Web Sockets seems like the more promising method that has gained more general Web adoption, and given the protocol is upgraded to version 1.0 on the Wikimedia side (see my rant on slide 7 of my Wiki Workshop deck), I am fine with it. Personally, I would have loved SSE to gain more adoption for unidirectional use cases for its simplicity, but well…

This comment was removed by Tomayac.

(Sorry, I am on a flaky train Wi-Fi, it seems my comment got double-posted despite one failure message, removed one of the two).

Thanks very much, @Tomayac, appreciate the analysis.

Much of the discussion here seems to directly compare WebSockets vs. SSE. I don't think this makes much sense. WebSockets is providing low-level bidirectional communication and framing, very much the way HTTP does, especially in HTTP/2. SSE adds a simple streaming format & minimal retry protocol on top of this. There is no equivalent standard protocol on top of WebSockets that I am aware of, so we are basically discussing SSE vs. a custom protocol.

Therefore, to me, SSE is pointless. It has no real advantage to socket.io or even simple polling.

SSE does define a standard protocol for client retries, and uses that mechanism automatically to recover from connection issues. Standard clients are available across basically all platforms, and are built into major browsers. You are right that clients that need to resume across client runs need to store the offset somewhere, but this is true with pretty much any protocol that doesn't remember per-client offsets server-side. With WebSockets, we will have to define a completely custom protocol equivalent to SSE. Users will either have to implement this themselves (more complex clients), or we'll have to maintain client implementations ourselves. In any case, browser use cases will have to download a client implementation, while with SSE most users will already have that implementation built in.

Performance-wise, the biggest difference between WebSockets & HTTP is probably about connection sharing. SSE will benefit from regular HTTP/2 connection sharing between tabs, while WebSockets will open new sockets for each tab & WebSocket connection. This means that there will be more TLS contexts to set up & manage server side when using WebSockets. Client side, maintaining more connections can lead to higher power usage.

In terms of adoption, I think it's fair to say that HTTP has broader support than WebSockets, and SSE has more market share than a custom protocol on top of websockets. Most large-scale public streaming implementations like Twitter or FaceBook / Google chat are using plain HTTP (streaming or long polling), and do not use WebSockets.

With HTTP, we can expose these feeds as part of wider HTTP / REST APIs. Documentation benefits from standard tooling like Swagger. A custom protocol on top of WebSockets does not fit into REST APIs, and can't readily leverage existing documentation tools.

Overall, I honestly see few reasons to use WebSockets & a custom protocol rather than HTTP & a signaling protocol like SSE for unidirectional stream use cases.

Even for bidirectional communications I wouldn't be surprised if WebSockets lost out to HTTP/2 and WebRTC longer term. HTTP/2 and WebSockets are very close in feature set and performance, and WebRTC is adding UDP support for really low latency use cases like games. See this article describing the lack of interest in supporting WebSockets in HTTP/2.

In any case, browser use cases will have to download a client implementation, while with SSE most users will already have that implementation built in.

I think Dan is saying that even with SSE, browser use cases will want to download a client implementation. SSE will only auto resume on disconnects, not reloads. So, if someone wants their app to resume when someone comes back to a page later, they will need some kind of offset storage, and code to load the last-event-id header properly, before initiating the SSE request. Dan is arguing that proper auto-resume functionality requires a custom client for SSE and for websockets, so last-event-id isn't much of an advantage for SSE.

I kinda agree, but I'd like to point out that it is a small advantage. I can see many browser base used cases that are just just for nice visualizations. E.g. the 'listen to wikipedia' thing that was on display in the office. If the browser temporarily looses the connection to the server, the auto-resume feature of SSE is nice to have here.

For more complex use cases, like mobile apps, or ORES type stuff, folks will be working in the language of their choice, not in a browser and will want to store offsets, so a custom client will be needed.

@Tomayac, thanks for the response. Q:

SSE seems to not scale too well with many messages, check, for example, my Twitter over SSE example, the browser starts to go down on its knees.

I see around 60 msgs/sec there, which isn't much. Correct me if I'm wrong, but it seems that a browser would probably start slowing if this was done with websockets too. The link there requests the SSE endpoint directly, so the browser is just rendering as much text as it can sequentially. A page that processed events into a display of some kind (e.g. this), rather than just dumping it all into text would probably do fine over SSE, even with more messages, no?

I see around 60 msgs/sec there, which isn't much.

It's actually a lot more, a Twitter gardenhose actually. The browser simply doesn't show them all and gives up. If you just listen to the even and don't do anything with the results (apart from counting), you get a lot more, but the browser really becomes unresponsive after a while.

Hm, EventSource kinda sucks?

I'd like to send a proper HTTP error response status if something is wrong before I start sending the chunked body. Say the client is trying to read from topics that aren't available, or have a badly formatted Last-Event-ID header. I can do this easily on the server side by just setting res.statusCode and closing the response before I initialize the SSE connection. Fine, this works for an http directly to the sse endpoint.

BUT! EventSource, which is built into most browsers, is pretty dumb about this. The spec simply says they should fire a simple event named error. This is what both a node implementation and webkit do.

This is not helpful! I'd like to be able to return a 404 if the client requests a stream for a topic that does not exist. If that client is using EventSource as expected, there is no way for me to do this.

The only alternative I have is to always rely on SSE/EventSource error events. That is, I will have to start the chunked body ASAP, which will start sending a 200 response immediately. I can then emit a custom error event to the EventSource client if they are asking for a topic that doesn't exist, but this doesn't feel very HTTP/REST like. If a resource doesn't exist, I should be able to 404, no?

@GWicke, thoughts?

We discussed error handling on IRC, and found that HTTP status codes / headers for SSE requests are reported in the dev console in Firefox and Chrome >= 54, but not in current stable Chrome (53). That is about to be released in a couple of days (Oct 18th), so this should become more usable soon.

Error handling features in non-browser clients vary a bit. The tornado python client is logging errors, while the node eventsource library is passing the status code in case of connection setup errors. We could consider contributing a separate event ("connectError" or the like) for better handling of connection errors.

Thanks for the feedback everyone. At this time, we are moving forward with SSE. We can always revisit possible websocket support later if the HTTP based streams service causes trouble.

Ottomata renamed this task from Public Event Streams to Bikeshed what events should be exposed in public EventStreams API.Nov 1 2016, 7:47 PM
Ottomata updated the task description. (Show Details)
Ottomata renamed this task from Bikeshed what events should be exposed in public EventStreams API to Public Event Streams.Nov 1 2016, 7:49 PM
Ottomata updated the task description. (Show Details)

Whoops, meant to create a subtask, not edit this one. Reverted.

At this time, we are moving forward with SSE. We can always revisit possible websocket support later if the HTTP based streams service causes trouble.

What, if anything, does this mean for irc.wikimedia.org and stream.wikimedia.org, both short-term and long-term?

From some internal discussions, it seems likely that irc.wikimedia.org will be remain as is. We may rework the backend, but the IRC functionality will probably stay the same, as there are so many tools already built on this. Deprecating it would be difficult.

stream.wikimedia.org has needed a revamp for a while. It is built using an out dated socket.io version, and it isn't always easy to find a compatible client version. RCStream (the stream.wikimedia.org backend) was originally going to be re-written in nodejs, and use a newer socket.io. Since Public EventStreams exposes similar functionality, the current plan is to deprecated stream.wikimedia.org, once Public EventStreams is deployed and reaches some feature parity. We'll make sure to have a well documented timeline for this once it makes sense to start thinking about it, probably next quarter.

Ottomata renamed this task from Public Event Streams to EventStreams.Feb 6 2017, 7:27 PM

Hi. I'm working on https://sv.wikipedia.org/wiki/MediaWiki:Gadget-EventStreams.js

I note that parsedcomment is not present. Is that absent by accident, or intentionally excluded for performance reasons?

Yes, comment is available. parsedcomment is not.

@Nimos: was it available in the old RCFeed? I do not see it on the comments but maybe was there under a different name: https://www.mediawiki.org/wiki/Manual:RCFeed

Can we ask you as to the usage of your gadget? Maybe better on #wikimedia-analytics

was it available in the old RCFeed?

I don't know, but it's available in the recentchanges API.

Can we ask you as to the usage of your gadget?

Not sure what you mean. I'm building a gadget that shows an RC feed on https://sv.wikipedia.org/wiki/Wikipedia:Senaste_%C3%A4ndringar/EventStreams so that patrollers don't have to reload Special:RecentChanges all the time. Having parsedcomment would be a big plus, but if it's impossible I'll just accept that.

Hi @Nirmos, I created T170145 to talk about this more.

FYI, I just turned off RCStream (routes)!

Tags aren't available either?

I think its time to close this task :)