Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T183978 [Epic] Fundraising kafkatee changes | |||
Open | None | T186048 Adapt ingress of CN data into Druid to EventLogging-based impression recording |
Event Timeline
(Note that Pivot is not working properly now for Banner impressions, due to some update somewhere, and new versions of Pivot no longer being FOSS. Replacement: https://superset.wikimedia.org/ . See also https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset .)
There are two fields that in Druid that we can't currently get with the pure event Data. One is country_matches_geocode, which indicates whether server-based geolocation gave the same result as the Geo cookie. (In a small number of cases, it doesn't.) The other is region.
We could get these by continuing to query the webrequest table in Hive instead of the standard Hive table created for our EL schema. But that would mean using a slightly odd, more complex Hive query to get the data.
As per Hangout discussion, instead we'll add region to the event data so we can get it via the normal route, and just remove country_matches_geocode from Druid. (If necessary, it'll still be possible to investigate Geo cookie issues directly via Hive.)
Change 430932 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[mediawiki/extensions/CentralNotice@master] Add geo region to client-side data and impression event
[mediawiki/extensions/CentralNotice@master] Add geo region to client-side data and impression event
Schema updated, too: https://meta.wikimedia.org/wiki/Schema:CentralNoticeImpression
Change 430932 merged by jenkins-bot:
[mediawiki/extensions/CentralNotice@master] Add geo region to client-side data and impression event
There are two fields that in Druid that we can't currently get with the pure event Data. One is country_matches_geocode, which indicates whether server-based geolocation gave the same result as the Geo cookie. (In a small number of cases, it doesn't.) The other is region.
Ooops! Looking at the EventLogging Hive table, I see there is actually a server-side geocoded data column. So we can keep both of those.
Still, it'll be good to get region via the client and use that in Druid so that it lines up with country. If we need server-geocoded geo data (rather than the cookie-provided data from the client) we can always query that directly in Hive.
Following @mepps's suggestion, I asked around about performance for both routes... @Milimetric confirmed that querying our schema's EL Hive table is much preferable performancewise to sifting through all of webrequest.
Finally, I looked into possible issues regarding possible differences in delays in data availability. Here's what I found:
- Currently there's an experimental near-realtime streaming job giving us up-to-the-minute data in Druid! Woohoo! Since it pulls data via Kafka, it should be easy to adapt. :)
- As per the wikitech page, maintenance for the streaming job is not a high priority. However, if it were to break, I guess we'd go back to hourly loading via Hive. Fortunately, the EL Hive table is populated on a similar delay to the currently used Webrequest table.
So, tl;dr: no benefits to sticking with webrequest table (or the webrequest Kafka topic for streaming), and less cpu cycles consumed on the Analytics cluster if we switch. And, realtime! Yay!
Thanks much!!!
Change 432405 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[analytics/refinery@master] [PLS. DON'T MERGE] Make banner activity Druid ingress from EventLogging
As mentioned on Gerrit, the patch uploaded is a rough attempt in need of review by those familiar with these systems.
Now checking out the realtime setup.
For the realtime job, are there any similar examples that have EventLogging sources that I could more or less copy?
The lines that mainly need to change are the filtering and mapping ones, and the specification of which stream to use as an input.
Assuming they're correct, the changes need to follow those to the Hive query in the (uncommitted) daily job update.
@Ottomata, @JAllemandou, any guidance would be hugely appreciated! Thanks!!!!
Seems to me this work is already completed here: https://phabricator.wikimedia.org/T203669
Please see: https://turnilo.wikimedia.org/#test_kafka_event_centralnoticeimpression
Analytics just needs to rename dataset and document job.