I was getting to grips with running queries on the Hive cluster via HUE.
To ground truth the queries I was running I was comparing with the druid data shown in turnilo.
Using filtering in turnilo for only a time period "Sept 14 2100-2200 UTC", campaign "wlm+2018" and country "GB" yielded total impressions of 15970
What I thought would be the exact same query in HUE:
SELECT COUNT(*) AS NumberOfImpressions FROM event.centralnoticeimpression WHERE (year='2018' AND month='9' AND day='14' AND hour='21') AND (event.campaign="wlm 2018") AND (event.country="GB")
Returns an impression count of 15348
I tried using the timestamp with the following query:
SELECT COUNT(*) AS NumberOfImpressions FROM event.centralnoticeimpression WHERE (year="2018") AND (event.campaign="wlm 2018") AND (dt LIKE '2018-09-14T21%') AND (event.country="GB")
returns: 15348
I then tried
SELECT COUNT(*) AS NumberOfImpressions FROM event.centralnoticeimpression WHERE (year="2018" AND month="9" AND day="14" AND hour="21") AND (event.campaign="wlm 2018") AND (geocoded_data["country_code"] = "GB")
returns 15325
SELECT COUNT(*) AS NumberOfImpressions FROM event.centralnoticeimpression WHERE (year="2018" AND month="9" AND day="14" AND hour="21") AND (event.campaign="wlm 2018") AND (geocoded_data["country"] = "United Kingdom")
returns 15325
I cant check to see if there are missing entries in the year/month/day/hour columns since there is a predicate on the table forcing me to use one of those.