Currently we search through the entire webrequest or webrequest_raw dataset for something that matches condition X and count it or save it somewhere for additional processing. Then, if we want to search for something that matches condition Y we do the same thing, looking at every record in these very large datasets. As we do this for many conditions, we waste a lot of cluster resources on just accessing data repeatedly.
Instead of searching for X and Y separately, let's make a system that knows all the X, Y, Z, etc. conditions that we're looking for and what to do once we find a match. Then for each record, we evaluate all the conditions at once and therefore minimize how often we touch the data. Since this is basically what streaming is built for, we want to implement it on top of the streaming data.
The whole set of conditions (to be listed here in the description by me in a moment) may include searching based on raw, refined, or aggregated data. At each point, we lose some ability to filter and gain some efficiency from already-computed fields like geolocation, pageview_info, etc. So this system would have to provide access to the data at each point. Since for now we only have raw data streaming out of kafka, we will start by prototyping a tool to handle use cases of searching through this raw data. Once we add stream refining and stream aggregations, we should be able to re-use the same system and just subscribe to different topics. Therefore, this prototype should make it easy to write some simple configuration and filter logic that will then:
- subscribe to a kafka topic
- apply a filter implemented in [scala, python, java, js, anything?]
- with any matching records, do any of:
- produce to a new topic
- count and send stats to grafana by [some granularity? hour/day...]
- do custom aggregation [is this supported by use cases? if not we can strike from initial]
- Make sure the list of use cases below includes everything we know of