Page MenuHomePhabricator

Provide performant query access to banner show/hide numbers
Closed, ResolvedPublic4 Story Points

Description

Currently, the hide numbers are only available through a hive query (see https://phabricator.wikimedia.org/T90635#1063926), which takes *hours*. Part of the issue is that we're parsing the URL many times for each record.

Instead, we should be parsing and storing this data into a (public) schema with at most a 15 minute lag behind real-time, with instantaneous query access and appropriate indexes.

Event Timeline

awight created this task.Feb 24 2015, 10:34 PM
awight raised the priority of this task from to Needs Triage.
awight updated the task description. (Show Details)
awight added subscribers: awight, ellery, Jalexander and 3 others.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 24 2015, 10:34 PM

My query took 4.3 days of CPU time, and an hour of wall time. And the results were wrong :)

@Ottomata mentioned that we could look at https://github.com/wikimedia/analytics-kafkatee

My preference is definitely to run some logic as part of the process that digests our weblogs, which parses the URL into components we care about and even aggregates over time if possible.

atgo triaged this task as Medium priority.Mar 3 2015, 12:33 AM
atgo set Security to None.

@kevinator, I hear you might be interested in this problem. We need to do something like, digest web requests into a database that can be queried efficiently for the type of things we're looking for.

I know nothing about whether it's easy or feasible... but I'm imagining something like a cluster-side script that logs certain requests directly in Kafka. Would that make any sense? Would it get us what we need?

BTW, soon we hope to turn off Special:RecordImpression and use instead Special:BannerLoader + another other logging call. So that might be a good time to move on this, since in any case we'll have to move some infrastructure around. :)

We could make special varnishkafka instances that logged only certain requests to a topic. Not a bad idea. Do you know which of the cache clusters get the requests you are interested in? text? bits?

Alternatively, we could use kafkatee to process these requests into something you need. Or, we could set up a streaming realtime framework to do something fancier. A realtime framework is the right way to do this, but it would take a lot more research and time and thought. What is your time frame for this?

Nuria closed this task as Resolved.Sep 26 2016, 4:00 PM
Nuria claimed this task.
Nuria added a subscriber: Nuria.

I think issues on ticket have solved, let us know otherwise.

awight reopened this task as Open.Sep 26 2016, 5:08 PM

Unfortunately, I don't know of any progress towards this goal. We still need to run the expensive Hive query, repeatedly parsing each URL's parameters.

Nuria added a comment.Sep 27 2016, 7:13 PM

This can be solved in two ways:

  1. Process webrequest records every hour to a table that stores the data you are interested on , this is how we calculate pageviews and other metrics, api folks do this to retrieve similar metrics.
  1. Set up an streaming platform in which we can filter incoming pageview stream for your results and publish those to a kafka topic that you read from

cc @Ottomata

ggellerman set the point value for this task to 4.Sep 27 2016, 9:07 PM
ggellerman moved this task from Triage to Q1 20/21 on the Fundraising-Backlog board.
Milimetric added a subscriber: Milimetric.EditedSep 28 2016, 10:06 AM

+1 to the streaming way. We have to get there eventually and this task is already so old it seems like it can wait for a proper solution. But we could always implement both. The hourly dump would just be some simple oozie no?

Nuria added a comment.Sep 29 2016, 3:24 PM

Let's look at this one as another use case for streaming cc @Milimetric and @Ottomata

Nuria added a comment.Mar 16 2017, 5:21 PM

ping @awight @AndyRussG I think our job on pivot and banners takes care of this item, let us know otherwise

Nuria closed this task as Resolved.Jun 12 2017, 4:03 PM

If newer banner dataset on druid needs more fields please file a task a to that extent

mmodell removed a subscriber: awight.Jun 22 2017, 9:33 PM