Page MenuHomePhabricator

Implement server side filtering for EventStreams (if we should)
Open, MediumPublic8 Estimated Story Points

Description

RCStream allows for simple wildcard based filtering on wiki database name. Since EventStreams topics could have arbitrary (schemaed) JSON data in them, we should provide more generic filtering options, (if we should do server side filtering at all). Kasocki had a couple of custom filtering options built in, but it was not decided if we ever should use them.

An idea is to use @query/schema to do fancy query param -> jsonschema based filtering. However, many nodejs jsonschema packages speed up validation by using code generation. We probably shouldn't generate code based on user input. :/

This ticket can be used to decide if we should implement server side filtering at all, and if we should, how.

Event Timeline

Ottomata renamed this task from Implement filtering (if we should) to Implement server side filtering (if we should).Dec 8 2016, 9:49 PM

Change 326040 had a related patch set uploaded (by Ottomata):
[PoC] Use @query/schema and ajv for server side filtering

https://gerrit.wikimedia.org/r/326040

However, many nodejs jsonschema packages speed up validation by using code generation. We probably shouldn't generate code based on user input. :/

Although libs like ajv escape the input schema, it sounds like a very very scary idea to me, because they never claim it's safe to generate code on user-provided schema, so I'm not sure, it seems like we open a huge attack surface if we're building AJV functions..

I think it would be helpful to start by considering the use cases we actually need to support.

So far, the only obvious case where a current kafka topic does not map directly to a public feed I am aware of is per-wiki feeds. This could internally be implemented as a filter on an existing topic, or it could be a separate per-project topic that is filtered in something like ChangeProp.

In any case, I would recommend to separate API concerns from implementation concerns. Depending on whether you want to allow arbitrarily complex matches, filtering might just be an internal implementation technique, or it might become part of the public API. My personal preference would be to keep it hidden unless absolutely needed.

Regarding the public API: In the case of per-wiki feeds, I personally think they would be most consistent and discoverable if we exposed them as part of the normal project API, for example at /api/rest_v1/stream/edits.

Sorry if this has come up before, but SSE actually does allow for event type specification. Currently you send your events like so:

event: message
id: [{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":142461965},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":-1}]
data: {"event": "data", "is": "here"}

People can then subscribe to generic events like so:

source.addEventListener('message', function(e) {
  var data = JSON.parse(e.data);
  console.log('Generic event', data);
}, false);

However, if you "namespace" your events (for example use enedit to describe an edit to en.wikipedia), people can subscribe to just those events they are interested in:

event: enedit
id: [{"topic":"eqiad.mediawiki.recentchange","partition":0,"offset":142461965},{"topic":"codfw.mediawiki.recentchange","partition":0,"offset":-1}]
data: {"event": "data", "is": "here"}

People can then subscribe to just enedit events like so:

source.addEventListener('enedit', function(e) {
  var data = JSON.parse(e.data);
  console.log('enedit event', data);
}, false);

The big decision there would be whether to double-send each edit event as a generic event and as a "namespaced" event (so people can have catch-all subscriptions and opt-in subscriptions, which is what I did in http://wikipedia-edits.herokuapp.com/sse), or to just send each event once as a "namespaced" event, so people interested in catching all events will have to subscribe to all namespaces.

Cheers,
Tom

PS: you should probably simplify the event IDs, I don't think it makes sense to have a JSON object as an event ID. Maybe just use the edit event's ID that is already part of the data payload?

Thanks for the response!

event: enedit

This doesn't actually do server side filtering, it just allows you to register a local event handler to fire on specific types of events. The server is still sending all messages to the client. Also, because we intend to expose arbitrary events, there is no guarantee that all events will have the concept of a 'wiki'. We'd need custom server side event field -> event namespace mapping configuration. But, most events will have a wiki project associated with them, so maybe it does make sense to do this.

I don't think it makes sense to have a JSON object as an event ID.

We need a complex object in order to do auto resuming from the past when a client disconnects. Each stream is made up of multiple Kafka topic partitions, each of which has its own offset. See EventStreams#Format and also KafkaSSE#notes-on-kafka-consumer-state.

I'm not sure, it seems like we open a huge attack surface if we're building AJV functions..

Agree. A quick look over https://github.com/tdegrunt/jsonschema seems like there's no fancy code generation. Maybe safe enough?

Hm... I think event without code generation we still need to disallow some JSON schema features, for example regexes since they could be used for a DOS attack.

Yeah, agree. We could fork it and just manually disable stuff we don't want to support I suppose.

I'd definitely like to have per-wiki filtering for events. More granular would be nice but not a must. Something like enedit may work but it'd be nice to have a way to have all (or some wide set) events from enwiki, not just edits but e.g. deletes or moves, etc.

We need a complex object in order to do auto resuming from the past when a client disconnects.

This is an interesting point for me. Does it mean I could "reconnect" to the stream at any point? Would it be possible to make it so I could choose that point e.g. by timestamp of some other such external parameter?

I really think we should consider exposing separate streams, one per wiki, ideally (if we can make the technical stuff work), as part of each wiki's REST API. From the user perspective, this is very easy to discover and consume. It also keeps progress tracking simple, as we don't need to expose any implementation details like filter state. Not exposing such details also lets us optimize things under the hood, for example by pre-filtering those streams into separate Kafka topics.

Does it mean I could "reconnect" to the stream at any point? Would it be possible to make it so I could choose that point e.g. by timestamp of some other such external parameter?

Timestamp, not yet. You can resume from the past, but you'd have to know the offset (position) in the topic, and the data would need to still be in Kafka, as we only keep the last 7 days.

Newer versions of Kafka support timestamp based consumption, and we'll be upgrading in the next quarter or two. After we have that in place, then we'll consider supporting timestamps in EventStreams.

we should consider exposing separate streams, one per wiki, ideally (if we can make the technical stuff work), as part of each wiki's REST API.

Not opposed, especially if this is just an API piped proxy to EventStreams. EventStreams could still support arbitrary field server side filtering, but the REST API would just fill in the proper filter parameter for the wiki being used.

I found out that EventStreams is fast enough that the client is able to consume all events (I do a long time test with 3 parallel streams) whereas with the previous socketIO based stream the buffer exceeded in very short time and wasn't be able to run more than one bigger site. This means a server side filter isn't as much important for EventStreams but it was necessary for rcstream.

Change 326040 abandoned by Ottomata:
[PoC] Use @query/schema and ajv for server side filtering

https://gerrit.wikimedia.org/r/326040

I can see advantages of having ES for a specific wiki. There are dozens if not 100s of edits taking place all over sisterhood at any moment. And in some cases, stream subsubscribers just don't need events for all wikis. Right now, this is leading to unnecessary data being fetched, parsed and compare, much of which is not even needed. I really hope we reconsider this, and provide an API for this to avoid unnecessary computation.

Aklapper renamed this task from Implement server side filtering (if we should) to Implement server side filtering for EventStreams (if we should).Feb 19 2021, 10:36 AM
Aklapper removed Ottomata as the assignee of this task.
Aklapper removed a subscriber: GWicke.

@Acagastya: If there are actual new technical arguments anyone is free to do so.

@Aklapper Should I update the task description to better explain the current arguments?

No, comments picking up arguments of previous comments are fine.