Page MenuHomePhabricator

Adapt ingress of CN data into Druid to EventLogging-based impression recording
Open, Needs TriagePublic

Description

Part of the replacement CentralNotice data pipeline.

This will also ensure data continues to be available in Superset and Turnilo (front-ends for Druid).

Event Timeline

(Note that Pivot is not working properly now for Banner impressions, due to some update somewhere, and new versions of Pivot no longer being FOSS. Replacement: https://superset.wikimedia.org/ . See also https://wikitech.wikimedia.org/wiki/Analytics/Systems/Superset .)

There are two fields that in Druid that we can't currently get with the pure event Data. One is country_matches_geocode, which indicates whether server-based geolocation gave the same result as the Geo cookie. (In a small number of cases, it doesn't.) The other is region.

We could get these by continuing to query the webrequest table in Hive instead of the standard Hive table created for our EL schema. But that would mean using a slightly odd, more complex Hive query to get the data.

As per Hangout discussion, instead we'll add region to the event data so we can get it via the normal route, and just remove country_matches_geocode from Druid. (If necessary, it'll still be possible to investigate Geo cookie issues directly via Hive.)

Change 430932 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[mediawiki/extensions/CentralNotice@master] Add geo region to client-side data and impression event

https://gerrit.wikimedia.org/r/430932

[mediawiki/extensions/CentralNotice@master] Add geo region to client-side data and impression event

Schema updated, too: https://meta.wikimedia.org/wiki/Schema:CentralNoticeImpression

Change 430932 merged by jenkins-bot:
[mediawiki/extensions/CentralNotice@master] Add geo region to client-side data and impression event

https://gerrit.wikimedia.org/r/430932

There are two fields that in Druid that we can't currently get with the pure event Data. One is country_matches_geocode, which indicates whether server-based geolocation gave the same result as the Geo cookie. (In a small number of cases, it doesn't.) The other is region.

Ooops! Looking at the EventLogging Hive table, I see there is actually a server-side geocoded data column. So we can keep both of those.

Still, it'll be good to get region via the client and use that in Druid so that it lines up with country. If we need server-geocoded geo data (rather than the cookie-provided data from the client) we can always query that directly in Hive.

Following @mepps's suggestion, I asked around about performance for both routes... @Milimetric confirmed that querying our schema's EL Hive table is much preferable performancewise to sifting through all of webrequest.

Finally, I looked into possible issues regarding possible differences in delays in data availability. Here's what I found:

  • Currently there's an experimental near-realtime streaming job giving us up-to-the-minute data in Druid! Woohoo! Since it pulls data via Kafka, it should be easy to adapt. :)
  • As per the wikitech page, maintenance for the streaming job is not a high priority. However, if it were to break, I guess we'd go back to hourly loading via Hive. Fortunately, the EL Hive table is populated on a similar delay to the currently used Webrequest table.

So, tl;dr: no benefits to sticking with webrequest table (or the webrequest Kafka topic for streaming), and less cpu cycles consumed on the Analytics cluster if we switch. And, realtime! Yay!

Thanks much!!!

Change 432405 had a related patch set uploaded (by AndyRussG; owner: AndyRussG):
[analytics/refinery@master] [PLS. DON'T MERGE] Make banner activity Druid ingress from EventLogging

https://gerrit.wikimedia.org/r/432405

As mentioned on Gerrit, the patch uploaded is a rough attempt in need of review by those familiar with these systems.

Now checking out the realtime setup.

For the realtime job, are there any similar examples that have EventLogging sources that I could more or less copy?

The lines that mainly need to change are the filtering and mapping ones, and the specification of which stream to use as an input.

Assuming they're correct, the changes need to follow those to the Hive query in the (uncommitted) daily job update.

@Ottomata, @JAllemandou, any guidance would be hugely appreciated! Thanks!!!!

Seems to me this work is already completed here: https://phabricator.wikimedia.org/T203669

Please see: https://turnilo.wikimedia.org/#test_kafka_event_centralnoticeimpression
Analytics just needs to rename dataset and document job.

AndyRussG renamed this task from Adapt Druid banenr_activity jobs to EventLogging-based impression recording to Adapt ingress of CN data into Druid to EventLogging-based impression recording.May 22 2019, 4:16 PM
AndyRussG updated the task description. (Show Details)