Page MenuHomePhabricator

Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset
Open, NormalPublic

Description

I was getting to grips with running queries on the Hive cluster via HUE.

To ground truth the queries I was running I was comparing with the druid data shown in turnilo.

Using filtering in turnilo for only a time period "Sept 14 2100-2200 UTC", campaign "wlm+2018" and country "GB" yielded total impressions of 15970

https://turnilo.wikimedia.org/#banner_activity_minutely/3/N4IgbglgzgrghgGwgLzgFwgewHYgFwhLYCmAtAMYAWcATmiADQgYC2xyOx+IAomuQHoAqgBUAwoxAAzCAjTEaUfAG1QaAJ4AHLgVZcmNYlO4B9E3sl6ASnGwBzYkryqQUNLXoEATAAYAjAAcpD4AnKR+ACwiXn54Pj5xPgB08T4AWpLE2AAm3L6BwWGR0V6JiSnxGQC+ALpVDGpaOq5oNBD2kobGBORwLJpwEHa4THDkGDjc7ZJgiDCOKiAA7ggsANT5ASA1TNiYnlKIUMT1jdrcbm0dBkbc5Jgw2K3qkmMTuATTTLMI807KIAA4gAhba7fb4Q4IY61JhQTRIND/M7NCxMbIQNjYKBYD4gMxokB2Gi2GAIWgQDTcAAKIj8AFlJFBMHR8KAuqZzJj9CAMYZxrjuHAoOQshiOvVCJjKfhsGSELCQGxhTBDE5QNB6WSMJCjsQ4e07AhmoYAI5/NAme6PehMFjaiCLM0Wq0PJ7bSWadokbIAEW52Nx/x2IC92B9AGUWZ4QM7HJbre6mMaHDlZfKmJQhpQkHZKJ45QgFUA===

What I thought would be the exact same query in HUE:

SELECT COUNT(*) AS NumberOfImpressions FROM event.centralnoticeimpression WHERE (year='2018' AND month='9' AND day='14' AND hour='21') AND (event.campaign="wlm 2018") AND (event.country="GB")

Returns an impression count of 15348

I tried using the timestamp with the following query:
SELECT COUNT(*) AS NumberOfImpressions FROM event.centralnoticeimpression WHERE (year="2018") AND (event.campaign="wlm 2018") AND (dt LIKE '2018-09-14T21%') AND (event.country="GB")

returns: 15348

I then tried

SELECT COUNT(*) AS NumberOfImpressions FROM event.centralnoticeimpression WHERE (year="2018" AND month="9" AND day="14" AND hour="21") AND (event.campaign="wlm 2018") AND (geocoded_data["country_code"] = "GB")

returns 15325

SELECT COUNT(*) AS NumberOfImpressions FROM event.centralnoticeimpression WHERE (year="2018" AND month="9" AND day="14" AND hour="21") AND (event.campaign="wlm 2018") AND (geocoded_data["country"] = "United Kingdom")

returns 15325

I cant check to see if there are missing entries in the year/month/day/hour columns since there is a predicate on the table forcing me to use one of those.

Event Timeline

Jseddon created this task.Sep 15 2018, 4:54 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 15 2018, 4:54 AM

As expected this https://superset.wikimedia.org/r/14 query matches turnillo

Jseddon renamed this task from Discrepancy between HUE query in Count of event.impression and turnillo to Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset.Sep 15 2018, 5:10 AM
Jseddon added a project: Analytics.
Milimetric assigned this task to mforns.Sep 24 2018, 3:43 PM
Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Data Quality on the Analytics board.
Milimetric added a project: Analytics-Kanban.

Will look into this in the next couple days.

mforns moved this task from Next Up to In Progress on the Analytics-Kanban board.Oct 2 2018, 3:01 PM
Nuria added a subscriber: Nuria.Oct 18 2018, 5:54 PM

We have looked into whether the source of discrepancy could be unvalid events and there does not seem to be any schema validation errors for this schema for the months of 9, 10.

Nuria added a comment.EditedOct 18 2018, 6:00 PM

Other possible cause could be that not all impressions are being sent by your EL code in the client (note that EL counts are lower than the ones that come from parsing the whole firehouse)

Questions:

  • Are these banners shown to ALL browsers or only to the ones to whom we serve javascript?

Note EL code uses sendbeacon and an image pixel, for browsers that do not support send beacon the EL request will not be sent if the customer moves out of the page before synchronous ping happen.

Nuria reassigned this task from mforns to JAllemandou.Oct 18 2018, 6:01 PM
Nuria added a subscriber: mforns.

Ping @AndyRussG regarding the question: "Are these banners shown to ALL browsers or only to the ones to whom we serve javascript?"

More investigations on this.
As @Nuria explained above, we think that the highest probability of problem is the client not managing to send the eventlogging event, for instance because the browser doesn't have javascript.
I have double checked data on the cluster and it goes in that sense.
I queried turnilo, eventlogging and webrequest for yesterday (Oct 17th 2018) on campaign ITwikicon (small volume).
What I find is that turnilo == webrequest(/beacon/impression) (expected), and that eventlogging == webrequest(/beacon/event) (expected as well). And as suspected, turnillo == /beacon/impression is a little higher than eventlogging == /beacon/event.
Repro for turnilo: here
Repro for eventlogging and webrequest, on stat100[4|5]:

spark2-shell --master yarn

// EventLogging data
spark.sql("select hour, COUNT(1) as c from event.centralnoticeimpression where year=2018 and month = 10 and day = 17 and event.campaign = 'ITwikicon' group by hour order by hour").collect.foreach(println)

// Webrequest Beacon Event
spark.sql("SELECT hour, count(1) FROM wmf.webrequest where webrequest_source = 'text' and year=  2018 and month = 10 and day = 17 and uri_path = '/beacon/event' and uri_query like '%CentralNoticeImpression%' and uri_query like '%ITwikicon%' group by hour order by hour").collect.foreach(println)

// Webrequest Beacon Impression
spark.sql("SELECT hour, count(1) FROM wmf.webrequest where webrequest_source = 'text' and year=  2018 and month = 10 and day = 17 and uri_path = '/beacon/impression' and uri_query like '%ITwikicon%' group by hour order by hour").collect.foreach(println)

Results:

Turnilo/beacon/impressionEvents/beacon/event
5555
3333
7777
4444
5555
11111111
14141313
18181717
31313131
37373737
36363434
33333333
31313131
38383838
35353535
42424242
23232323
39393939
32323131
19191919
26262626
23232323
20202020
16161616
Nuria added a comment.Oct 18 2018, 8:42 PM

Some more context, using our data we can see that you are not sending as many beacon events (/beacon/event URL) with eventlogging as impressions are recorded via data in webrequest table (/beacon/impression URL), then:

FR tech needs to verify what browsers are receiving banners: are ALL browsers receiving banners ALSO capable of sending a EL events? If the answer to this question is yes you might be looking at bug in your EL code or (if difference is small), to browsers that do not support sendBeacon and thus, have navigated away from the page in which the banner is shown before the event is triggered. Seems that some quality checking of your EL data is in order to make sure you are not missing browsers, etc.

Assigning ticket to FR tech, moving to radar.

Nuria reassigned this task from JAllemandou to AndyRussG.Oct 18 2018, 8:42 PM
Nuria removed a project: Analytics-Kanban.
Nuria added a subscriber: JAllemandou.
Nuria added a subscriber: EBjune.Oct 18 2018, 8:46 PM
AndyRussG added a comment.EditedOct 19 2018, 2:45 PM

@Nuria, @JAllemandou, @mforns thanks so much taking the time to check this out, much appreciated!!! We can dig into it further (as priorities allow). :)

Nuria added a comment.Oct 19 2018, 2:59 PM

@AndyRussG

Can you answer these questions:

  • are all browsers receiving banners?
  • are only js-enabled browsers receiving banners?

Can you answer these questions

Hey... of course, apologies for the delays!!!

  • are all browsers receiving banners?
  • are only js-enabled browsers receiving banners?

Only JS-enabled browsers display CentralNotice banners. The system for selecting a banner to show to a given user relies on heavily on JavaScript. (See browser support matrix.)

@JAllemandou I think you're approach of comparing Hive tables directly is right on! I'll continue this approach, and look for possible causes in the client-side JS (again, as priorities allow). :)

Nuria added a comment.Oct 19 2018, 6:43 PM

@AndyRussG I would look at EL data and see if any browser is notably missing from the events you have sent and received (hive events table). If the browsers missing are notably the ones w/o support of SendBeacon then your issue is users navigating outside the page before events get sent, that means -by proxy- that they did not engaged with banner. If, however, there is no clear browser pattern to your missing data you probably have an issue with your instrumentation that you need to fix.

Nuria raised the priority of this task from High to Unbreak Now!.Nov 1 2018, 8:51 PM
Nuria moved this task from Data Quality to Radar on the Analytics board.
Restricted Application added subscribers: Pcoombe, Liuxinyu970226, TerraCodes. · View Herald TranscriptNov 1 2018, 8:51 PM
Nuria added a comment.Nov 1 2018, 8:56 PM

FYI that we are puting this on radar, this item is on FR court

DStrine lowered the priority of this task from Unbreak Now! to High.Nov 1 2018, 9:07 PM
DStrine added a subscriber: DStrine.

Thanks all for discussing this. I'm putting the priority back to "High" per our prioritization definitions: https://wikitech.wikimedia.org/wiki/Fundraising/Bug_Triaging#Prioritization

We have a couple of other critical tasks right now but this will be considered soon.

Quick update here. We are still using the older pipeline for this season so this is not affecting us at the moment. We'll try to look at this again next year.

DStrine lowered the priority of this task from High to Normal.Dec 18 2018, 8:47 PM
Nuria added a comment.Dec 18 2018, 8:49 PM

@DStrine can you clarify what is the old pipeline?

Kafkatee

@AndyRussG assures us this would not have impacted our campaign stats.