Prototype counting of requests with real time (streaming data)
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• Nuria
	Feb 28 2017, 7:55 PM

Description

Problem

Currently we search through the entire webrequest or webrequest_raw dataset for something that matches condition X and count it or save it somewhere for additional processing. Then, if we want to search for something that matches condition Y we do the same thing, looking at every record in these very large datasets. As we do this for many conditions, we waste a lot of cluster resources on just accessing data repeatedly.

Proposal

Instead of searching for X and Y separately, let's make a system that knows all the X, Y, Z, etc. conditions that we're looking for and what to do once we find a match. Then for each record, we evaluate all the conditions at once and therefore minimize how often we touch the data. Since this is basically what streaming is built for, we want to implement it on top of the streaming data.

Plan

The whole set of conditions (to be listed here in the description by me in a moment) may include searching based on raw, refined, or aggregated data. At each point, we lose some ability to filter and gain some efficiency from already-computed fields like geolocation, pageview_info, etc. So this system would have to provide access to the data at each point. Since for now we only have raw data streaming out of kafka, we will start by prototyping a tool to handle use cases of searching through this raw data. Once we add stream refining and stream aggregations, we should be able to re-use the same system and just subscribe to different topics. Therefore, this prototype should make it easy to write some simple configuration and filter logic that will then:

subscribe to a kafka topic
apply a filter implemented in [scala, python, java, js, anything?]
with any matching records, do any of:
- produce to a new topic
- count and send stats to grafana by [some granularity? hour/day...]
- do custom aggregation [is this supported by use cases? if not we can strike from initial]

Action Items

Make sure the list of use cases below includes everything we know of

Related Objects
Search...

Status	Assigned	Task
Resolved	Ottomata	T185233 Modern Event Platform
Declined	None	T159264 Prototype counting of requests with real time (streaming data)
Declined	None	T122245 REST API entry point web request statistics at the Varnish level

Event Timeline

• Nuria created this task.Feb 28 2017, 7:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 28 2017, 7:55 PM

• Nuria added a subtask: T122245: REST API entry point web request statistics at the Varnish level.Feb 28 2017, 7:56 PM

Milimetric updated the task description. (Show Details)Feb 28 2017, 8:38 PM

produce to a new topic

count and send stats to grafana by [some granularity? hour/day...]

These two can probably be made generic, but it would be important to be able to provide a custom function as well, not just custom aggregations.

• Nuria moved this task from Incoming to Wikistats on the Analytics board.Mar 2 2017, 5:08 PM

• Nuria moved this task from Wikistats to Dashiki on the Analytics board.Mar 13 2017, 5:09 PM

Milimetric triaged this task as Medium priority.May 8 2017, 2:31 PM

JAllemandou subscribed.Jun 27 2017, 8:10 AM

• Nuria moved this task from Dashiki to Wikistats on the Analytics board.Jul 3 2017, 4:55 PM

• Nuria added a parent task: T185233: Modern Event Platform.Feb 5 2018, 5:51 PM

• Nuria moved this task from Wikistats to Backlog (Later) on the Analytics board.

• Pchelolo closed subtask T122245: REST API entry point web request statistics at the Varnish level as Declined.Jul 31 2019, 10:44 PM

Currently we search through the entire webrequest or webrequest_raw dataset for something that matches condition X

The premise of this ticket is a bit old, rather than look through the whole firehose we moved our thinking towards events a while back. That is not to say we should not filter streams but rather that webrequest stream should probably be used less and less.

From grooming: Closing this, as we have many other open prototypes.

mforns closed this task as Declined.Aug 10 2020, 4:04 PM

Prototype counting of requests with real time (streaming data)Closed, DeclinedPublicActions