Page MenuHomePhabricator

Prototype counting of requests with real time (streaming data)
Closed, DeclinedPublic

Description

Problem

Currently we search through the entire webrequest or webrequest_raw dataset for something that matches condition X and count it or save it somewhere for additional processing. Then, if we want to search for something that matches condition Y we do the same thing, looking at every record in these very large datasets. As we do this for many conditions, we waste a lot of cluster resources on just accessing data repeatedly.

Proposal

Instead of searching for X and Y separately, let's make a system that knows all the X, Y, Z, etc. conditions that we're looking for and what to do once we find a match. Then for each record, we evaluate all the conditions at once and therefore minimize how often we touch the data. Since this is basically what streaming is built for, we want to implement it on top of the streaming data.

Plan

The whole set of conditions (to be listed here in the description by me in a moment) may include searching based on raw, refined, or aggregated data. At each point, we lose some ability to filter and gain some efficiency from already-computed fields like geolocation, pageview_info, etc. So this system would have to provide access to the data at each point. Since for now we only have raw data streaming out of kafka, we will start by prototyping a tool to handle use cases of searching through this raw data. Once we add stream refining and stream aggregations, we should be able to re-use the same system and just subscribe to different topics. Therefore, this prototype should make it easy to write some simple configuration and filter logic that will then:

  • subscribe to a kafka topic
  • apply a filter implemented in [scala, python, java, js, anything?]
  • with any matching records, do any of:
    • produce to a new topic
    • count and send stats to grafana by [some granularity? hour/day...]
    • do custom aggregation [is this supported by use cases? if not we can strike from initial]

Action Items

  • Make sure the list of use cases below includes everything we know of

Event Timeline

  • produce to a new topic
  • count and send stats to grafana by [some granularity? hour/day...]

These two can probably be made generic, but it would be important to be able to provide a custom function as well, not just custom aggregations.

Milimetric triaged this task as Medium priority.May 8 2017, 2:31 PM

Currently we search through the entire webrequest or webrequest_raw dataset for something that matches condition X

The premise of this ticket is a bit old, rather than look through the whole firehose we moved our thinking towards events a while back. That is not to say we should not filter streams but rather that webrequest stream should probably be used less and less.

From grooming: Closing this, as we have many other open prototypes.